from:"\"DB Tsai\""

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread DB Tsai

+1 On Apr 29, 2024, at 8:01 PM, Wenchen Fan  wrote:To add more color:Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different reader/writer. Ideally the Spark native data source reader/writer is faster than the Hive Serde ones.What's more, the default format of Hive Serde is text. I don't think people want to use text format tables in production. Most people will add `STORED AS parquet` or `USING parquet` explicitly. By setting this config to false, we have a more reasonable default behavior: creating Parquet tables (or whatever is specified by `spark.sql.sources.default`).On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan  wrote:@Mich Talebzadeh there seems to be a misunderstanding here. The Spark native data source table is still stored in the Hive metastore, it's just that Spark will use a different (and faster) reader/writer for it. `hive-site.xml` should work as it is today.On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon  wrote:+1It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table.Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle for this compatibility reason during making Spark sound. The compatibility has been tested in production for a long time so I don't see any particular issue about the compatibility case you mentioned.On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh  wrote:Hi @Wenchen Fan Thanks for your response. I believe we have not had enough time to "DISCUSS" this matter. Currently in order to make Spark take advantage of Hive, I create a soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1 /opt/spark/conf/hive-site.xml -> /data6/hduser/hive-3.1.1/conf/hive-site.xmlThis works fine for me in my lab. So in the future if we opt to use the setting "spark.sql.legacy.createHiveTableByDefault" to False, there will not be a need for this logical link.? On the face of it, this looks fine but in real life it may require a number of changes to the old scripts. Hence my concern. As a matter of interest has anyone liaised with the Hive team to ensure they have introduced the additional changes you outlined?HTHMich Talebzadeh,Technologist | Architect | Data Engineer  | Generative AI | FinCrimeLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:@Mich Talebzadeh thanks for sharing your concern!Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case, instead of creating Hive table and fail.On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:+1 (non-binding)

Thanks,
Cheng Pan

On Sat, Apr 27, 2024 at 9:29 AM Holden Karau  wrote:
>
> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh  wrote:
>>
>> +1
>>
>> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun  wrote:
>> >
>> > I'll start with my +1.
>> >
>> > Dongjoon.
>> >
>> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>> > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault
>> > > to `false` by default. The technical scope is defined in the following PR.
>> > >
>> > > - DISCUSSION:
>> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>> > > - PR: https://github.com/apache/spark/pull/46207
>> > >
>> > > The vote is open until April 30th 1AM (PST) and passes
>> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
>> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ...
>> > >
>> > > Thank you in advance.
>> > >
>> > > Dongjoon
>> > >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: d

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-16 Thread DB Tsai

+1Sent from my iPhoneOn Apr 16, 2024, at 3:11 PM, bo yang  wrote:+1On Tue, Apr 16, 2024 at 1:38 PM Hyukjin Kwon  wrote:+1On Wed, Apr 17, 2024 at 3:57 AM L. C. Hsieh  wrote:+1

On Tue, Apr 16, 2024 at 4:08 AM Wenchen Fan  wrote:
>
> +1
>
> On Mon, Apr 15, 2024 at 12:31 PM Dongjoon Hyun  wrote:
>>
>> I'll start with my +1.
>>
>> - Checked checksum and signature
>> - Checked Scala/Java/R/Python/SQL Document's Spark version
>> - Checked published Maven artifacts
>> - All CIs passed.
>>
>> Thanks,
>> Dongjoon.
>>
>> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 3.4.3.
>> >
>> > The vote is open until April 18th 1AM (PDT) and passes if a majority +1 PMC
>> > votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.4.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> >
>> > The tag to be voted on is v3.4.3-rc2 (commit
>> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
>> > https://github.com/apache/spark/tree/v3.4.3-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1453/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
>> >
>> > The list of bug fixes going into 3.4.3 can be found at the following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
>> >
>> > This release is using the release script of the tag v3.4.3-rc2.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.4.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.4.3 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> > Version/s" = 3.4.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Versioning of Spark Operator

2024-04-09 Thread DB Tsai

 Aligning with Spark releases is sensible, as it allows us to guarantee that 
the Spark operator functions correctly with the new version while also 
maintaining support for previous versions.
 
DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

> On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan  wrote:
> 
> 
>   I am trying to understand if we can simply align with Spark's version for 
> this ?
> Makes the release and jira management much more simpler for developers and 
> intuitive for users.
> 
> Regards,
> Mridul
> 
> 
> On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  <mailto:dongj...@apache.org>> wrote:
>> Hi, Liang-Chi.
>> 
>> Thank you for leading Apache Spark K8s operator as a shepherd. 
>> 
>> I took a look at `Apache Spark Connect Go` repo mentioned in the thread. 
>> Sadly, there is no release at all and no activity since last 6 months. It 
>> seems to be the first time for Apache Spark community to consider these 
>> sister repositories (Go and K8s Operator).
>> 
>> https://github.com/apache/spark-connect-go/commits/master/
>> 
>> Dongjoon.
>> 
>> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
>> > Hi all,
>> > 
>> > We've opened the dedicated repository of Spark Kubernetes Operator,
>> > and the first PR is created.
>> > Thank you for the review from the community so far.
>> > 
>> > About the versioning of Spark Operator, there are questions.
>> > 
>> > As we are using Spark JIRA, when we are going to merge PRs, we need to
>> > choose a Spark version. However, the Spark Operator is versioning
>> > differently than Spark. I'm wondering how we deal with this?
>> > 
>> > Not sure if Connect also has its versioning different to Spark? If so,
>> > maybe we can follow how Connect does.
>> > 
>> > Can someone who is familiar with Connect versioning give some suggestions?
>> > 
>> > Thank you.
>> > 
>> > Liang-Chi
>> > 
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> > <mailto:dev-unsubscr...@spark.apache.org>
>> > 
>> > 
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> <mailto:dev-unsubscr...@spark.apache.org>
>>

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread DB Tsai

+1

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

> On Nov 14, 2023, at 10:14 AM, Vakaris Baškirov  
> wrote:
> 
> +1 (non-binding)
> 
> On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  <mailto:sunc...@apache.org>> wrote:
>> +1
>> 
>> On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh > <mailto:vii...@gmail.com>> wrote:
>> >
>> > +1
>> >
>> > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou > > <mailto:zhouye...@gmail.com>> wrote:
>> > >
>> > > +1(Non-binding)
>> > >
>> > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh > > > <mailto:vii...@gmail.com>> wrote:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> I’d like to start a vote for SPIP: An Official Kubernetes Operator for
>> > >> Apache Spark.
>> > >>
>> > >> The proposal is to develop an official Java-based Kubernetes operator
>> > >> for Apache Spark to automate the deployment and simplify the lifecycle
>> > >> management and orchestration of Spark applications and Spark clusters
>> > >> on k8s at prod scale.
>> > >>
>> > >> This aims to reduce the learning curve and operation overhead for
>> > >> Spark users so they can concentrate on core Spark logic.
>> > >>
>> > >> Please also refer to:
>> > >>
>> > >>- Discussion thread:
>> > >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
>> > >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
>> > >>- SPIP doc: 
>> > >> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>> > >>
>> > >>
>> > >> Please vote on the SPIP for the next 72 hours:
>> > >>
>> > >> [ ] +1: Accept the proposal as an official SPIP
>> > >> [ ] +0
>> > >> [ ] -1: I don’t think this is a good idea because …
>> > >>
>> > >>
>> > >> Thank you!
>> > >>
>> > >> Liang-Chi Hsieh
>> > >>
>> > >> -
>> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> > >> <mailto:dev-unsubscr...@spark.apache.org>
>> > >>
>> > >
>> > >
>> > > --
>> > >
>> > > Zhou, Ye  周晔
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> > <mailto:dev-unsubscr...@spark.apache.org>
>> >
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> <mailto:dev-unsubscr...@spark.apache.org>
>>

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread DB Tsai

+1

To be completely transparent, I am employed in the same department as Zhou at 
Apple.

I support this proposal, provided that we witness community adoption following 
the release of the Flink Kubernetes operator, streamlining Flink deployment on 
Kubernetes. 

A well-maintained official Spark Kubernetes operator is essential for our Spark 
community as well.

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

> On Nov 9, 2023, at 12:05 PM, Zhou Jiang  wrote:
> 
> Hi Spark community,
> I'm reaching out to initiate a conversation about the possibility of 
> developing a Java-based Kubernetes operator for Apache Spark. Following the 
> operator pattern 
> (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
> users may manage applications and related components seamlessly using native 
> tools like kubectl. The primary goal is to simplify the Spark user experience 
> on Kubernetes, minimizing the learning curve and operational complexities and 
> therefore enable users to focus on the Spark application development.
> Although there are several open-source Spark on Kubernetes operators 
> available, none of them are officially integrated into the Apache Spark 
> project. As a result, these operators may lack active support and development 
> for new features. Within this proposal, our aim is to introduce a Java-based 
> Spark operator as an integral component of the Apache Spark project. This 
> solution has been employed internally at Apple for multiple years, operating 
> millions of executors in real production environments. The use of Java in 
> this solution is intended to accommodate a wider user and contributor 
> audience, especially those who are familiar with Scala.
> Ideally, this operator should have its dedicated repository, similar to Spark 
> Connect Golang or Spark Docker, allowing it to maintain a loose connection 
> with the Spark release cycle. This model is also followed by the Apache Flink 
> Kubernetes operator.
> We believe that this project holds the potential to evolve into a thriving 
> community project over the long run. A comparison can be drawn with the Flink 
> Kubernetes Operator: Apple has open-sourced internal Flink Kubernetes 
> operator, making it a part of the Apache Flink project 
> (https://github.com/apache/flink-kubernetes-operator). This move has gained 
> wide industry adoption and contributions from the community. In a mere year, 
> the Flink operator has garnered more than 600 stars and has attracted 
> contributions from over 80 contributors. This showcases the level of 
> community interest and collaborative momentum that can be achieved in similar 
> scenarios.
> More details can be found at SPIP doc : Spark Kubernetes Operator 
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> Thanks,
> 
> --
> Zhou JIANG
>

Re: [VOTE][SPIP] Lazy Materialization for Parquet Read Performance Improvement

2023-02-14 Thread DB Tsai

+1

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

> On Feb 14, 2023, at 8:29 AM, Guo Weijie  wrote:
> 
> +1 
> 
> Yuming Wang mailto:wgy...@gmail.com>> 于2023年2月14日周二 
> 15:58写道：
>> +1
>> 
>> On Tue, Feb 14, 2023 at 11:27 AM Prem Sahoo > <mailto:prem.re...@gmail.com>> wrote:
>>> +1
>>> 
>>> On Mon, Feb 13, 2023 at 8:13 PM L. C. Hsieh >> <mailto:vii...@gmail.com>> wrote:
>>>> +1
>>>> 
>>>> On Mon, Feb 13, 2023 at 3:49 PM Mich Talebzadeh >>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>> +1 for me
>>>>> 
>>>>> 
>>>>>view my Linkedin profile 
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>> 
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>> 
>>>>>  
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>> loss, damage or destruction of data or any other property which may arise 
>>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>>> The author will in no case be liable for any monetary damages arising 
>>>>> from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>> 
>>>>> On Mon, 13 Feb 2023 at 23:18, huaxin gao >>>> <mailto:huaxin.ga...@gmail.com>> wrote:
>>>>>> +1
>>>>>> 
>>>>>> On Mon, Feb 13, 2023 at 3:09 PM Dongjoon Hyun >>>>> <mailto:dongj...@apache.org>> wrote:
>>>>>>> +1
>>>>>>> 
>>>>>>> Dongjoon
>>>>>>> 
>>>>>>> On 2023/02/13 22:52:59 "L. C. Hsieh" wrote:
>>>>>>> > Hi all,
>>>>>>> > 
>>>>>>> > I'd like to start the vote for SPIP: Lazy Materialization for Parquet
>>>>>>> > Read Performance Improvement.
>>>>>>> > 
>>>>>>> > The high summary of the SPIP is that it proposes an improvement to the
>>>>>>> > Parquet reader with lazy materialization which only materializes (i.e.
>>>>>>> > decompress, de-code, etc...) necessary values. For Spark-SQL filter
>>>>>>> > operations, evaluating the filters first and lazily materializing only
>>>>>>> > the used values can save computation wastes and improve the read
>>>>>>> > performance.
>>>>>>> > 
>>>>>>> > References:
>>>>>>> > 
>>>>>>> > JIRA ticket https://issues.apache.org/jira/browse/SPARK-42256
>>>>>>> > SPIP doc 
>>>>>>> > https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>>>> > Discussion thread
>>>>>>> > https://lists.apache.org/thread/5yf2ylqhcv94y03m7gp3mgf3q0fp6gw6
>>>>>>> > 
>>>>>>> > Please vote on the SPIP for the next 72 hours:
>>>>>>> > 
>>>>>>> > [ ] +1: Accept the proposal as an official SPIP
>>>>>>> > [ ] +0
>>>>>>> > [ ] -1: I don’t think this is a good idea because …
>>>>>>> > 
>>>>>>> > Thank you!
>>>>>>> > 
>>>>>>> > Liang-Chi Hsieh
>>>>>>> > 
>>>>>>> > -
>>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>>>>>> > <mailto:dev-unsubscr...@spark.apache.org>
>>>>>>> > 
>>>>>>> > 
>>>>>>> 
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>>>>>> <mailto:dev-unsubscr...@spark.apache.org>
>>>>>>>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread DB Tsai

+1Sent from my iPhoneOn Jan 31, 2023, at 4:16 PM, Yuming Wang wrote:+1.On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura wrote:Great! Much appreciated, Mitch!
KazuOn Jan 31, 2023, at 3:07 PM, Mich Talebzadeh wrote:Thanks, Kazu.I followed that template link and indeed as you pointed out it is a common template. If it works then it is what it is.I will be going through your design proposals and hopefully we can review it.Regards,Mich

view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.

On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura wrote:Thank you Mich. I followed the instruction at https://spark.apache.org/improvement-proposals.html and used its template.While we are open to revise our design doc, it seems more like you are proposing the community to change the instruction per se?
KazuOn Jan 31, 2023, at 11:24 AM, Mich Talebzadeh wrote:Hi,Thanks for these proposals. good suggestions. Is this style of breaking down your approach standard?My view would be that perhaps it makes more sense to follow the industry established approach of breaking down your technical proposal into:BackgroundObjectiveScopeConstraintsAssumptionsReportingDeliverablesTimelinesAppendixYour current approach using below Q1. What are you trying to do? Articulate your objectives
using absolutely no jargon. What are you trying to achieve?Q2. What problem is this proposal NOT designed to solve? What issues the suggested proposal is not going to addressQ3. How is it done today, and what are the limits of
current practice?Q4. What is new in your approach approach and why do you think it
will be successful succeed?Q5. Who cares? If you are successful, what difference
will it make? If your proposal succeeds, what tangible benefits will it add?Q6. What are the risks?Q7. How long will it take?Q8. What are the midterm and final “exams” to check for
success? May not do justice to your proposal.HTHMich

view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura wrote:Hi everyone,I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.We made our design doc as follows.SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzMELiang-Chi was kind enough to shepherd this effort. Thank youKazu

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread DB Tsai

Thank you, Huaxin for the 3.2.1 release!

Sent from my iPhone

> On Jan 28, 2022, at 5:45 PM, Chao Sun  wrote:
> 
> 
> Thanks Huaxin for driving the release!
> 
>> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng  wrote:
>> It's Great!
>> Congrats and thanks, huaxin!
>> 
>> 
>> -- 原始邮件 --
>> 发件人: "huaxin gao" ;
>> 发送时间: 2022年1月29日(星期六) 上午9:07
>> 收件人: "dev";"user";
>> 主题: [ANNOUNCE] Apache Spark 3.2.1 released
>> 
>> We are happy to announce the availability of Spark 3.2.1!
>> 
>> Spark 3.2.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.2 maintenance branch of Spark. We strongly
>> recommend all 3.2 users to upgrade to this stable release.
>> 
>> To download Spark 3.2.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>> 
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-2-1.html
>> 
>> We would like to acknowledge all community members for contributing to this
>> release. This release would not have been possible without you.
>> 
>> Huaxin Gao

Re: Apache Spark Jenkins Infra 2022

2022-01-09 Thread DB Tsai

Thank you, Dongjoon for driving the build infra.

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

> On Jan 9, 2022, at 6:38 PM, shane knapp ☠  wrote:
> 
> 
> apache spark jenkins lives on!
> 
> @dongjoon, let me know if there's anything you need.  nice work, as always.  
> :)
> 
> shane
> 
> On Sat, Jan 8, 2022 at 7:40 PM Yikun Jiang  <mailto:yikunk...@gmail.com>> wrote:
> @Dongjoon Hyun <mailto:dongjoon.h...@gmail.com> Thanks for your work on 
> “Apache Spark Jenkins Infra 2022”. I think this work has very important and 
> useful for CI job that Github Actions cannot support yet.
> 
> Regards,
> Yikun
> 
> 
> Dongjoon Hyun mailto:dongjoon.h...@gmail.com>> 
> 于2022年1月9日周日 07:11写道：
> Happy New Year!
> 
> After we sunset our legacy Jenkins Infra on December 23th, 2021,
> there were many missing parts in our test coverage combinations.
> 
> From Today, January 8th, 2022, the following test coverage is recovered
> and newly added as a starter. Although this is a pilot and a small step 
> forward,
> we will continue to build and improve our test coverage for the community.
> 
> ## Maven Test Coverage
> Although Apache Spark supports Maven/SBT build and testing,
> Maven is our official standard for building Apache Spark distributions.
> Since GitHub Action has been covering Maven building only, new Jenkins
> infrastructure recovers Maven building and testing.
> 
> ## Java 17 on Apple Silicon Coverage
> Since there is no publicly available CI option for us to test Apple Silicon 
> machines,
> the new Jenkins infrastructure is running Java/Scala/Python/R testing on 
> Apple Silicon.
> Please note that Java 17 is the first Java release supporting Apple Silicon 
> natively.
> (JEP 391: macOS/AArch64 Port supporting Apple M1-based machine)
> 
> 
> 
> This is maintained by Apache Spark PMC.
> We have more details to be discussed before exposing this infra to the public.
> 
> I want to give my heartfelt thanks to the generous donor and
> ASF Foundation Fundraise Team for making this happen.
> 
> Thanks,
> Dongjoon
> 
> 
> -- 
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu <https://rise.cs.berkeley.edu/>

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-12 Thread DB Tsai

+1

On Fri, Nov 12, 2021 at 6:34 PM Anton Okolnychyi 
wrote:

> +1 from me too to indicate my commitment (non-binding)
>
> - Anton
>
> > On 12 Nov 2021, at 18:27, Liang Chi Hsieh  wrote:
> >
> > I’d vote my +1 first.
> >
> > On 2021/11/13 02:25:05 "L. C. Hsieh" wrote:
> >> Hi all,
> >>
> >> I’d like to start a vote for SPIP: Row-level operations in Data Source
> V2.
> >>
> >> The proposal is to add support for executing row-level operations
> >> such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
> >> execution should be the same across data sources and the best way to do
> >> that is to implement it in Spark.
> >>
> >> Right now, Spark can only parse and to some extent analyze DELETE,
> UPDATE,
> >> MERGE commands. Data sources that support row-level changes have to
> build
> >> custom Spark extensions to execute such statements. The goal of this
> effort
> >> is to come up with a flexible and easy-to-use API that will work across
> >> data sources.
> >>
> >> Please also refer to:
> >>
> >>  - Previous discussion in dev mailing list: [DISCUSS] SPIP:
> >> Row-level operations in Data Source V2
> >>  <https://lists.apache.org/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv>
> >>
> >>  - JIRA: SPARK-35801 <https://issues.apache.org/jira/browse/SPARK-35801
> >
> >>  - PR for handling DELETE statements:
> >> <https://github.com/apache/spark/pull/33008>
> >>
> >>  - Design doc
> >> <
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
> >
> >>
> >> Please vote on the SPIP for the next 72 hours:
> >>
> >> [ ] +1: Accept the proposal as an official SPIP
> >> [ ] +0
> >> [ ] -1: I don’t think this is a good idea because …
> >>
> >> -----
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >>
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-29 Thread DB Tsai

+1
DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1


On Fri, Oct 29, 2021 at 11:42 AM Ryan Blue  wrote:

> +1
>
> On Fri, Oct 29, 2021 at 11:06 AM huaxin gao 
> wrote:
>
>> +1
>>
>> On Fri, Oct 29, 2021 at 10:59 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Dongjoon
>>>
>>> On 2021/10/29 17:48:59, Russell Spitzer 
>>> wrote:
>>> > +1 This is a great idea, (I have no Apache Spark voting points)
>>> >
>>> > On Fri, Oct 29, 2021 at 12:41 PM L. C. Hsieh 
>>> wrote:
>>> >
>>> > >
>>> > > I'll start with my +1.
>>> > >
>>> > > On 2021/10/29 17:30:03, L. C. Hsieh  wrote:
>>> > > > Hi all,
>>> > > >
>>> > > > I’d like to start a vote for SPIP: Storage Partitioned Join for
>>> Data
>>> > > Source V2.
>>> > > >
>>> > > > The proposal is to support a new type of join: storage partitioned
>>> join
>>> > > which
>>> > > > covers bucket join support for DataSourceV2 but is more general.
>>> The goal
>>> > > > is to let Spark leverage distribution properties reported by data
>>> > > sources and
>>> > > > eliminate shuffle whenever possible.
>>> > > >
>>> > > > Please also refer to:
>>> > > >
>>> > > >- Previous discussion in dev mailing list: [DISCUSS] SPIP:
>>> Storage
>>> > > Partitioned Join for Data Source V2
>>> > > ><
>>> > >
>>> https://lists.apache.org/thread.html/r7dc67c3db280a8b2e65855cb0b1c86b524d4e6ae1ed9db9ca12cb2e6%40%3Cdev.spark.apache.org%3E
>>> > > >
>>> > > >.
>>> > > >- JIRA: SPARK-37166 <
>>> > > https://issues.apache.org/jira/browse/SPARK-37166>
>>> > > >- Design doc <
>>> > >
>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>> >
>>> > >
>>> > > >
>>> > > > Please vote on the SPIP for the next 72 hours:
>>> > > >
>>> > > > [ ] +1: Accept the proposal as an official SPIP
>>> > > > [ ] +0
>>> > > > [ ] -1: I don’t think this is a good idea because …
>>> > > >
>>> > > >
>>> -
>>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > > >
>>> > > >
>>> > >
>>> > > -
>>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >
>>> > >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-23 Thread DB Tsai

+1 on this SPIP.

This is a more generalized version of bucketed tables and bucketed
joins which can eliminate very expensive data shuffles when joins, and
many users in the Apache Spark community have wanted this feature for
a long time!

Thank you, Ryan and Chao, for working on this, and I look forward to
it as a new feature in Spark 3.3

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
>
> Hi,
>
> Ryan and I drafted a design doc to support a new type of join: storage 
> partitioned join which covers bucket join support for DataSourceV2 but is 
> more general. The goal is to let Spark leverage distribution properties 
> reported by data sources and eliminate shuffle whenever possible.
>
> Design doc: 
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>  (includes a POC link at the end)
>
> We'd like to start a discussion on the doc and any feedback is welcome!
>
> Thanks,
> Chao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-11 Thread DB Tsai

+1

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

On Mon, Oct 11, 2021 at 6:01 AM Almeida, (Ricardo)
 wrote:
>
> +1 (non-binding)
>
>
>
> Ricardo Almeida
>
>
>
> From: Xiao Li 
> Sent: Monday, October 11, 2021 9:09 AM
> To: Yi Wu 
> Cc: Holden Karau ; Wenchen Fan ; 
> Cheng Pan ; Spark Dev List ; Ye 
> Zhou ; angers zhu 
> Subject: Re: [VOTE] Release Spark 3.2.0 (RC7)
>
>
>
> +1
>
>
>
> Xiao Li
>
>
>
> Yi Wu  于2021年10月11日周一 上午12:08写道：
>
> +1 (non-binding)
>
>
>
> On Mon, Oct 11, 2021 at 1:57 PM Holden Karau  wrote:
>
> +1
>
>
>
> On Sun, Oct 10, 2021 at 10:46 PM Wenchen Fan  wrote:
>
> +1
>
>
>
> On Sat, Oct 9, 2021 at 2:36 PM angers zhu  wrote:
>
> +1 (non-binding)
>
>
>
> Cheng Pan  于2021年10月9日周六 下午2:06写道：
>
> +1 (non-binding)
>
>
>
> Integration test passed[1] with my project[2].
>
>
>
> [1] https://github.com/housepower/spark-clickhouse-connector/runs/3834335017
>
> [2] https://github.com/housepower/spark-clickhouse-connector
>
>
>
> Thanks,
>
> Cheng Pan
>
>
>
>
>
> On Sat, Oct 9, 2021 at 2:01 PM Ye Zhou  wrote:
>
> +1 (non-binding).
>
>
>
> Run Maven build, tested within our YARN cluster, in client or cluster mode, 
> with push based shuffle enabled/disalbled, and shuffling a large amount of 
> data. Applications ran successfully with expected shuffle behavior.
>
>
>
> On Fri, Oct 8, 2021 at 10:06 PM sarutak  wrote:
>
> +1
>
> I think no critical issue left.
> Thank you Gengliang.
>
> Kousuke
>
> > +1
> >
> > Looks good.
> >
> > Liang-Chi
> >
> > On 2021/10/08 16:16:12, Kent Yao  wrote:
> >> +1 (non-binding) BR
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >> 
> >> 
> >> font{
> >> line-height: 1.6;
> >> }
> >> 
> >>  >> size="3">Kent Yao  >> style="color: rgb(82, 82, 82); font-family: 宋体-简; font-size:
> >> x-small;">@ Data Science Center, Hangzhou Research Institute, NetEase
> >> Corp. >> style="font-size: 13px;">a s >> color="#525252" style="font-size: 13px;">park enthusiast >> style="color: rgb(0, 0, 0); font-family: Helvetica; font-size:
> >> 13px;"> > class="mr-2 flex-self-stretch" style="box-sizing: border-box;
> > align-self: stretch !important; margin-right: 8px !important;"> > face="宋体-简" color="#525252" class=" classDarkfont" style="box-sizing:
> > border-box; font-size: 13px;"> > class="" href="https://github.com/yaooqinn/kyuubi"; style="box-sizing:
> > border-box;">kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and
> > analytics, built on top of  > href="http://spark.apache.org/"; rel="nofollow" style="font-we
> >  ight: normal; color: rgb(49, 53, 59); font-family: 宋体-简; font-size:
> > 13px; box-sizing: border-box; font-variant-ligatures: normal;">Apache
> > Spark. > class=" d-flex flex-wrap flex-items-center break-word f3 text-normal
> > classDarkfont" style="box-sizing: border-box; margin: 0px;
> > font-variant-ligatures: normal;

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread DB Tsai

Hello Xiao, there are multiple patches in Spark 3.2 depending on parquet
1.12, so it might be easier to wait for the fix in parquet community
instead of reverting all the related changes. The fix in parquet community
is very trivial, and we hope that it will not take too long. Thanks.
DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1


On Tue, Aug 31, 2021 at 1:09 PM Chao Sun  wrote:

> Hi Xiao, I'm still checking with the Parquet community on this. Since the
> fix is already +1'd, I'm hoping this won't take long. The delta in
> parquet-1.12.x branch is also small with just 2 commits so far.
>
> Chao
>
> On Tue, Aug 31, 2021 at 12:03 PM Xiao Li  wrote:
>
>> Hi, Chao,
>>
>> How long will it take? Normally, in the RC stage, we always revert the
>> upgrade made in the current release. We did the parquet upgrade multiple
>> times in the previous releases for avoiding the major delay in our Spark
>> release
>>
>> Thanks，
>>
>> Xiao
>>
>>
>> On Tue, Aug 31, 2021 at 11:03 AM Chao Sun  wrote:
>>
>>> The Apache Parquet community found an issue [1] in 1.12.0 which could
>>> cause incorrect file offset being written and subsequently reading of the
>>> same file to fail. A fix has been proposed in the same JIRA and we may have
>>> to wait until a new release is available so that we can upgrade Spark with
>>> the hot fix.
>>>
>>> [1]: https://issues.apache.org/jira/browse/PARQUET-2078
>>>
>>> On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:
>>>
>>>> Maybe, I'm just confused why it's needed at all. Other profiles that
>>>> add a dependency seem OK, but something's different here.
>>>>
>>>> One thing we can/should change is to simply remove the
>>>>  block in the profile. It should always be a direct
>>>> dep in Scala 2.13 (which lets us take out the profiles in submodules, which
>>>> just repeat that)
>>>> We can also update the version, by the by.
>>>>
>>>> I tried this and the resulting POM still doesn't look like what I
>>>> expect though.
>>>>
>>>> (The binary release is OK, FWIW - it gets pulled in as a JAR as
>>>> expected)
>>>>
>>>> On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
>>>> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
>>>>> help you out here.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Steve C
>>>>>
>>>>> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>>>>>
>>>>> OK right, you would have seen a different error otherwise.
>>>>>
>>>>> Yes profiles are only a compile-time thing, but they should affect the
>>>>> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
>>>>> scala-parallel-collections as a dependency in the POM as expected (not in 
>>>>> a
>>>>> profile). However I see what you see in the .pom in the release repo, and
>>>>> in my local repo after building - it's just sitting there as a profile as
>>>>> if it weren't activated or something.
>>>>>
>>>>> I'm confused then, that shouldn't be what happens. I'd say maybe there
>>>>> is a problem with the release script, but seems to affect a simple local
>>>>> build. Anyone else more expert in this see the problem, while I try to
>>>>> debug more?
>>>>> The binary distro may actually be fine, I'll check; it may even not
>>>>> matter much for users who generally just treat Spark as a 
>>>>> compile-time-only
>>>>> dependency either. But I can see it would break exactly your case,
>>>>> something like a self-contained test job.
>>>>>
>>>>> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
>>>>> wrote:
>>>>>
>>>>>> I did indeed.
>>>>>>
>>>>>> The generated spark-core_2.13-3.2.0.pom that is created alongside the
>>>>>> jar file in the local repo contains:
>>>>>>
>>>>>> 
>>>>>>   scala-2.13
>>>>>>   
>>>>>> 
>>>>>>   org.scala-lang.modules
>>>>>>
>>>>>> scala-parallel-collection

Re: [DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-24 Thread DB Tsai

+1 on renaming.

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

> On Jun 24, 2021, at 11:41 AM, Chao Sun  wrote:
> 
> Hi,
> 
> As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile name 
> hadoop-3.2 is no longer accurate, and it may confuse Spark users when they 
> realize the actual version is not Hadoop 3.2.x. Therefore, I created 
> https://issues.apache.org/jira/browse/SPARK-33880 
> <https://issues.apache.org/jira/browse/SPARK-33880> to change the profile 
> name to hadoop-3 and hadoop-2 respectively. What do you think? Is this 
> something worth doing as part of Spark 3.2.0 release?
> 
> Best,
> Chao

Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread DB Tsai

+1 (binding)

> On Apr 28, 2021, at 9:26 AM, Liang-Chi Hsieh  wrote:
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.4.8.
>
> The vote is open until May 4th at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.8
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.8 (try project = SPARK AND
> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.8-rc3 (commit
> e89526d2401b3a04719721c923a6f630e555e286):
> https://github.com/apache/spark/tree/v2.4.8-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc3-bin/
>
> Signatures used for Spark RCs can be found in t
his file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1377/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc3-docs/
>
> The list of bug fixes going into 2.4.8 can be found at the following URL:
> https://s.apache.org/spark-v2.4.8-rc3
>
> This release is using the release script of the tag v2.4.8-rc3.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the stag
ing repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.8?
> ===
>
> The current list of open tickets targeted at 2.4.8 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.8
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being
 said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>



publickey - dbtsai@dbtsai.com - 5566b502.asc
Description: application/pgp-keys


signature.asc
Description: OpenPGP digital signature

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-14 Thread DB Tsai

+1 (binding)

DB Tsai  |  ACS Spark Core  |   Apple, Inc.

> On Apr 14, 2021, at 10:42 AM, Wenchen Fan  wrote:
> 
> +1 (binding)
> 
> On Thu, Apr 15, 2021 at 12:22 AM Maxim Gekk  <mailto:maxim.g...@databricks.com>> wrote:
> +1 (non-binding)
> 
> On Wed, Apr 14, 2021 at 6:39 PM Dongjoon Hyun  <mailto:dongjoon.h...@gmail.com>> wrote:
> +1
> 
> Bests,
> Dongjoon.
> 
> On Tue, Apr 13, 2021 at 10:38 PM Kent Yao  <mailto:yaooq...@gmail.com>> wrote:
> +1 (non-binding)
> 
> Kent Yao 
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> a spark enthusiast
> kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC 
> interface for large-scale data processing and analytics, built on top of 
> Apache Spark <http://spark.apache.org/>.
> spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark SQL 
> extension which provides SQL Standard Authorization for Apache Spark 
> <http://spark.apache.org/>.
> spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for 
> reading data from and transferring data to Postgres / Greenplum with Spark 
> SQL and DataFrames, 10~100x faster.
> spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A library 
> that brings excellent and useful functions from various modern database 
> management systems to Apache Spark <http://spark.apache.org/>.
> 
> 
>  
> 
> On 04/14/2021 13:36，Gengliang Wang 
> <mailto:ltn...@gmail.com> wrote：
> +1 (non-binding)
> 
> On Wed, Apr 14, 2021 at 1:34 PM Jungtaek Lim  <mailto:kabhwan.opensou...@gmail.com>> wrote:
> +1 (non-binding)
> 
> signature OK, extracting tgz files OK, build source without running tests OK.
> 
> On Tue, Apr 13, 2021 at 5:02 PM Herman van Hovell  <mailto:her...@databricks.com>> wrote:
> +1
> 
> On Tue, Apr 13, 2021 at 2:40 AM sarutak  <mailto:saru...@oss.nttdata.com>> wrote:
> +1 (non-binding)
> 
> > +1
> > 
> > On Tue, 13 Apr 2021, 02:58 Sean Owen,  > <mailto:sro...@gmail.com>> wrote:
> > 
> >> +1 same result as last RC for me.
> >> 
> >> On Mon, Apr 12, 2021, 12:53 AM Liang-Chi Hsieh  >> <mailto:vii...@gmail.com>>
> >> wrote:
> >> 
> >>> Please vote on releasing the following candidate as Apache Spark
> >>> version
> >>> 2.4.8.
> >>> 
> >>> The vote is open until Apr 15th at 9AM PST and passes if a
> >>> majority +1 PMC
> >>> votes are cast, with a minimum of 3 +1 votes.
> >>> 
> >>> [ ] +1 Release this package as Apache Spark 2.4.8
> >>> [ ] -1 Do not release this package because ...
> >>> 
> >>> To learn more about Apache Spark, please see
> >>> http://spark.apache.org/ <http://spark.apache.org/>
> >>> 
> >>> There are currently no issues targeting 2.4.8 (try project = SPARK
> >>> AND
> >>> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In
> >>> Progress"))
> >>> 
> >>> The tag to be voted on is v2.4.8-rc2 (commit
> >>> a0ab27ca6b46b8e5a7ae8bb91e30546082fc551c):
> >>> https://github.com/apache/spark/tree/v2.4.8-rc2 
> >>> <https://github.com/apache/spark/tree/v2.4.8-rc2>
> >>> 
> >>> The release files, including signatures, digests, etc. can be
> >>> found at:
> >>> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-bin/ 
> >>> <https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-bin/>
> >>> 
> >>> Signatures used for Spark RCs can be found in this file:
> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS 
> >>> <https://dist.apache.org/repos/dist/dev/spark/KEYS>
> >>> 
> >>> The staging repository for this release can be found at:
> >>> 
> >> 
> > https://repository.apache.org/content/repositories/orgapachespark-1373/ 
> > <https://repository.apache.org/content/repositories/orgapachespark-1373/>
> >>> 
> >>> The documentation corresponding to this release can be found at:
> >>> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-docs/ 
> >>> <https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-docs/>
> >>> 
> >>> The list of bug fixes going into 2.4.8 can be found at the
> >>> following URL:
> >>> https://s.apache.org/spark-v2.4.8-rc2 
> >>> <https://s

Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread DB Tsai

+1 to add it as an external module so people can test it out and give
feedback easier.

On Mon, Feb 8, 2021 at 10:22 PM Gabor Somogyi  wrote:
>
> +1 adding it any way.
>
> On Mon, 8 Feb 2021, 21:54 Holden Karau,  wrote:
>>
>> +1 for an external module.
>>
>> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su  wrote:
>>>
>>> +1 for (2) adding to external module.
>>>
>>> I think this feature is useful and popular in practice, and option 2 is not 
>>> conflict with previous concern for dependency.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> From: Dongjoon Hyun 
>>> Date: Monday, February 8, 2021 at 10:39 AM
>>> To: Jacek Laskowski 
>>> Cc: Liang-Chi Hsieh , dev 
>>> Subject: Re: [DISCUSS] Add RocksDB StateStore
>>>
>>>
>>>
>>> Thank you, Liang-chi and all.
>>>
>>>
>>>
>>> +1 for (2) external module design because it can deliver the new feature in 
>>> a safe way.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski  wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm "okay to add RocksDB StateStore as external module". See no reason not 
>>> to.
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> 
>>>
>>> https://about.me/JacekLaskowski
>>>
>>> "The Internals Of" Online Books
>>>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, we need state store for state management for
>>> stateful operators such streaming aggregates, joins, etc. We have one and
>>> only one state store implementation now. It is in-memory hashmap which was
>>> backed up in HDFS complaint file system at the end of every micro-batch.
>>>
>>> As it basically uses in-memory map to store states, memory consumption is a
>>> serious issue and state store size is limited by the size of the executor
>>> memory. Moreover, state store using more memory means it may impact the
>>> performance of task execution that requires memory too.
>>>
>>> Internally we see more streaming applications that requires large state in
>>> stateful operations. For such requirements, we need a StateStore not rely on
>>> memory to store states.
>>>
>>> This seems to be also true externally as several other major streaming
>>> frameworks already use RocksDB for state management. RocksDB is an embedded
>>> DB and streaming engines can use it to store state instead of memory
>>> storage.
>>>
>>> So seems to me, it is proven to be good choice for large state usage. But
>>> Spark SS still lacks of a built-in state store for the requirement.
>>>
>>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
>>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>>> maintenance cost and it introduces RocksDB dependency.
>>>
>>> For the first concern, as more users require to use the feature, it should
>>> be highly used code in SS and more developers will look at it. For second
>>> one, we propose (SPARK-34198) to add it as an external module to relieve the
>>> dependency concern.
>>>
>>> Because it was pushed back previously, I'm going to raise this discussion to
>>> know what people think about it now, in advance of submitting any code.
>>>
>>> I think there might be some possible opinions:
>>>
>>> 1. okay to add RocksDB StateStore into sql core module
>>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>>> 3. either 1 or 2 is okay
>>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>>> external module
>>>
>>> Please let us know if you have some thoughts.
>>>
>>> Thank you.
>>>
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



-- 
Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread DB Tsai

+1

On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh  wrote:

> +1
>
> Chandni
>
> On Mon, Sep 14, 2020 at 11:41 AM Tom Graves 
> wrote:
>
>> +1
>>
>> Tom
>>
>> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan <
>> mri...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
>> shuffle to improve shuffle efficiency.
>> Please take a look at:
>>
>>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>>- SPIP doc:
>>
>> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>>- POC against master and results summary :
>>
>> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>>
>> Active discussions on the jira and SPIP document have settled.
>>
>> I will leave the vote open until Friday (the 18th September 2020), 5pm
>> CST.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>>
>> Thanks,
>> Mridul
>>
>

-- 
Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

Re: [VOTE] Decommissioning SPIP

2020-07-02 Thread DB Tsai

+1

On Thu, Jul 2, 2020 at 8:59 AM Ryan Blue  wrote:

> +1
>
> On Thu, Jul 2, 2020 at 8:00 AM Dongjoon Hyun 
> wrote:
>
>> +1.
>>
>> Thank you, Holden.
>>
>> Bests,
>> Dongjoon.
>>
>> On Thu, Jul 2, 2020 at 6:43 AM wuyi  wrote:
>>
>>> +1 for having this feature in Spark
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -----
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
-- 
Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-13 Thread DB Tsai

For example, JDK11 requires dependency changes which can not go into 2.4.7. 
Recent development on Kube such as supporting dynamical allocation in Spark 3.0 
in Kube (without shuffle service) will be hard to go in 2.4.7.

Sent from my iPhone

> On Jun 12, 2020, at 11:50 PM, Reynold Xin  wrote:
> 
> 
> Echoing Sean's earlier comment … What is the functionality that would go into 
> a 2.5.0 release, that can't be in a 2.4.7 release? 
> 
> 
>> On Fri, Jun 12, 2020 at 11:14 PM, Holden Karau  wrote:
>> Can I suggest we maybe decouple this conversation a bit? First, if there is 
>> an agreement in making a transitional release in principle and then folks 
>> who feel strongly about specific backports can have their respective 
>> discussions.It's not like we normally know or have agreement on everything 
>> going into a release at the time we cut the branch.
>> 
>> On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin  wrote:
>> I understand the argument to add JDK 11 support just to extend the EOL, but 
>> the other things seem kind of arbitrary and are not supported by your 
>> arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api 
>> stable yet and will continue to evolve in the 3.x line. 
>> 
>> Spark is designed in a way that’s decoupled from storage, and as a result 
>> one can run multiple versions of Spark in parallel during migration. 
>> At the job level sure, but upgrading large jobs, possibly written in Scala 
>> 2.11, whole-hog as it currently stands is not a small matter. 
>> 
>> On Fri, Jun 12, 2020 at 9:40 PM DB Tsai  wrote:
>> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>> 
>> We had an internal preview version of Spark 3.0 for our customers to try out 
>> for a while, and then we realized that it's very challenging for enterprise 
>> applications in production to move to Spark 3.0. For example, many of our 
>> customers' Spark applications depend on some internal projects that may not 
>> be owned by ETL teams; it requires much coordination with other teams to 
>> cross-build the dependencies that Spark applications depend on with Scala 
>> 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in 
>> Spark 3.0, this results in a really big gap to migrate from 2.x version to 
>> 3.0 based on my observation working with our customers.
>> 
>> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by 
>> the infra team, and requires an exception to use unsupported JDK. Of course, 
>> for those companies, they can use vendor's Spark distribution such as CDH 
>> Spark 2.4 which supports JDK11 or they can maintain their own Spark release 
>> which is possible but not very trivial.
>> 
>> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support 
>> can definitely lower the gap, and users can still move forward using new 
>> features. Afterall, the reason why we are working on OSS is we like people 
>> to use our code, isn't it?
>> 
>> Sincerely,
>> 
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>> 
>> 
>> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim  
>> wrote:
>> I guess we already went through the same discussion, right? If anyone is 
>> missed, please go through the discussion thread. [1] The consensus looks to 
>> be not positive to migrate the new DSv2 into Spark 2.x version line, because 
>> the change is pretty much huge, and also backward incompatible.
>> 
>> What I can think of benefits of having Spark 2.5 is to avoid force upgrade 
>> to the major release to have fixes for critical bugs. Not all critical fixes 
>> were landed to 2.x as well, because some fixes bring backward 
>> incompatibility. We don't land these fixes to the 2.x version line because 
>> we didn't consider having Spark 2.5 before - we don't want to let end users 
>> tolerate the inconvenience during upgrading bugfix version. End users may be 
>> OK to tolerate during upgrading minor version, since they can still live 
>> with 2.4.x to deny these fixes.
>> 
>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we 
>> might want to consider porting some of features which don't bring backward 
>> incompatibility. Well, new major features of Spark 3.0 would be probably 
>> better to be introduced in Spark 3.0, but some features could be, especially 
>> if the feature resolves the long-standing issue or the feature has been 
>

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread DB Tsai

+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try
out for a while, and then we realized that it's very challenging for
enterprise applications in production to move to Spark 3.0. For example,
many of our customers' Spark applications depend on some internal projects
that may not be owned by ETL teams; it requires much coordination with
other teams to cross-build the dependencies that Spark applications depend
on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
by the infra team, and requires an exception to use unsupported JDK. Of
course, for those companies, they can use vendor's Spark distribution such
as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
can definitely lower the gap, and users can still move forward using new
features. Afterall, the reason why we are working on OSS is we like people
to use our code, isn't it?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim 
wrote:

> I guess we already went through the same discussion, right? If anyone is
> missed, please go through the discussion thread. [1] The consensus looks to
> be not positive to migrate the new DSv2 into Spark 2.x version line,
> because the change is pretty much huge, and also backward incompatible.
>
> What I can think of benefits of having Spark 2.5 is to avoid force upgrade
> to the major release to have fixes for critical bugs. Not all critical
> fixes were landed to 2.x as well, because some fixes bring backward
> incompatibility. We don't land these fixes to the 2.x version line because
> we didn't consider having Spark 2.5 before - we don't want to let end users
> tolerate the inconvenience during upgrading bugfix version. End users may
> be OK to tolerate during upgrading minor version, since they can still live
> with 2.4.x to deny these fixes.
>
> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
> might want to consider porting some of features which don't bring backward
> incompatibility. Well, new major features of Spark 3.0 would be probably
> better to be introduced in Spark 3.0, but some features could be,
> especially if the feature resolves the long-standing issue or the feature
> has been provided for a long time in competitive products.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1.
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>
> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue 
> wrote:
>
>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>
>> There are a lot of big differences between the API in 2.4 and 3.0, and I
>> think a release to help migrate would be beneficial to organizations like
>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>> Migration to Spark 3 is going to take time as people build confidence in
>> it. I don't think that can be avoided by leaving a larger feature gap
>> between 2.x and 3.0.
>>
>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li  wrote:
>>
>>> Based on my understanding, DSV2 is not stable yet. It still
>>> misses various features. Even our built-in file sources are still unable to
>>> fully migrate to DSV2. We plan to enhance it in the next few releases to
>>> close the gap.
>>>
>>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>>> application. We should encourage more users to try Spark 3 and increase the
>>> adoption of Spark 3.x.
>>>
>>> Xiao
>>>
>>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau 
>>> wrote:
>>>
>>>> So I one of the things which we’re planning on backporting internally
>>>> is DSv2, which I think being available in a community release in a 2 branch
>>>> would be more broadly useful. Anything else on top of that would be on a
>>>> case by case basis for if they make an easier upgrade path to 3.
>>>>
>>>> If we’re worried about people using 2.5 as a long term home we could
>>>> always mark it with “-transitional” or something similar?
>>>>
>>>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen  wrote:
>

Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread DB Tsai

+1 (binding)

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Mon, Jun 8, 2020 at 1:03 PM Dongjoon Hyun  wrote:
>
> +1
>
> Thanks,
> Dongjoon.
>
> On Mon, Jun 8, 2020 at 6:37 AM Russell Spitzer  
> wrote:
>>
>> +1 (non-binding) ran the new SCC DSV2 suite and all other tests, no issues
>>
>> On Sun, Jun 7, 2020 at 11:12 PM Yin Huai  wrote:
>>>
>>> Hello everyone,
>>>
>>> I am wondering if it makes more sense to not count Saturday and Sunday. I 
>>> doubt that any serious testing work was done during this past weekend. Can 
>>> we only count business days in the voting process?
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Sun, Jun 7, 2020 at 3:24 PM Denny Lee  wrote:
>>>>
>>>> +1 (non-binding)
>>>>
>>>> On Sun, Jun 7, 2020 at 3:21 PM Jungtaek Lim  
>>>> wrote:
>>>>>
>>>>> I'm seeing the effort of including the correctness issue SPARK-28067 [1] 
>>>>> to 3.0.0 via SPARK-31894 [2]. That doesn't seem to be a regression so 
>>>>> technically doesn't block the release, so while it'd be good to weigh its 
>>>>> worth (it requires some SS users to discard the state so might bring less 
>>>>> frightened requiring it in major version upgrade), it looks to be 
>>>>> optional to include SPARK-28067 to 3.0.0.
>>>>>
>>>>> Besides, I see all blockers look to be resolved, thanks all for the 
>>>>> amazing efforts!
>>>>>
>>>>> +1 (non-binding) if the decision of SPARK-28067 is "later".
>>>>>
>>>>> 1. https://issues.apache.org/jira/browse/SPARK-28067
>>>>> 2. https://issues.apache.org/jira/browse/SPARK-31894
>>>>>
>>>>> On Mon, Jun 8, 2020 at 5:23 AM Matei Zaharia  
>>>>> wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> Matei
>>>>>>
>>>>>> On Jun 7, 2020, at 6:53 AM, Maxim Gekk  wrote:
>>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> On Sun, Jun 7, 2020 at 2:34 PM Takeshi Yamamuro  
>>>>>> wrote:
>>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> I don't see any ongoing PR to fix critical bugs in my area.
>>>>>>> Bests,
>>>>>>> Takeshi
>>>>>>>
>>>>>>> On Sun, Jun 7, 2020 at 7:24 PM Mridul Muralidharan  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mridul
>>>>>>>>
>>>>>>>> On Sat, Jun 6, 2020 at 1:20 PM Reynold Xin  wrote:
>>>>>>>>>
>>>>>>>>> Apologies for the mistake. The vote is open till 11:59pm Pacific time 
>>>>>>>>> on Mon June 9th.
>>>>>>>>>
>>>>>>>>> On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin  
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Please vote on releasing the following candidate as Apache Spark 
>>>>>>>>>> version 3.0.0.
>>>>>>>>>>
>>>>>>>>>> The vote is open until [DUE DAY] and passes if a majority +1 PMC 
>>>>>>>>>> votes are cast, with a minimum of 3 +1 votes.
>>>>>>>>>>
>>>>>>>>>> [ ] +1 Release this package as Apache Spark 3.0.0
>>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>>
>>>>>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>>>>>
>>>>>>>>>> The tag to be voted on is v3.0.0-rc3 (commit 
>>>>>>>>>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>>>>>>>>>> https://github.com/apache/spark/tree/v3.0.0-rc3
>>>>>>>>>>
>>>>>>>>>> The release files, including signatures, digests, etc. can be found 
>>>>&g

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-05-31 Thread DB Tsai

+1 (binding), thanks!

On Sun, May 31, 2020 at 9:23 PM Wenchen Fan  wrote:

> +1 (binding), although I don't know why we jump from RC 3 to RC 8...
>
> On Mon, Jun 1, 2020 at 7:47 AM Holden Karau  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.4.6.
>>
>> The vote is open until June 5th at 9AM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.6
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> There are currently no issues targeting 2.4.6 (try project = SPARK AND
>> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
>>
>> The tag to be voted on is v2.4.6-rc8 (commit
>> 807e0a484d1de767d1f02bd8a622da6450bdf940):
>> https://github.com/apache/spark/tree/v2.4.6-rc8
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1349/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-docs/
>>
>> The list of bug fixes going into 2.4.6 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12346781
>>
>> This release is using the release script of the tag v2.4.6-rc8.
>>
>> FAQ
>>
>> =
>> What happened to the other RCs?
>> =
>>
>> The parallel maven build caused some flakiness so I wasn't comfortable
>> releasing them. I backported the fix from the 3.0 branch for this release.
>> I've got a proposed change to the build script so that we only push tags
>> when once the build is a success for the future, but it does not block this
>> release.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.6?
>> ===
>>
>> The current list of open tickets targeted at 2.4.6 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.6
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
- DB Sent from my iPhone

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread DB Tsai

I am changing my vote from +1 to +0.

Since Spark 3.0 is Scala 2.12 only, having a transitional 2.4.x
release with great support of Scala 2.12 is very important. I would
like to have [SPARK-31399][CORE] Support indylambda Scala closure in
ClosureCleaner backported. Without it, it might break users' code when
upgrading from Scala 2.11 to Scala 2.12.

Thanks,

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Mon, May 18, 2020 at 2:47 PM Holden Karau  wrote:
>
> Another two candidates for backporting that have come up since this RC are 
> SPARK-31692 & SPARK-31399. What are folks thoughts, should we roll an RC4?
>
> On Mon, May 18, 2020 at 2:13 PM Sean Owen  wrote:
>>
>> Ah OK, I assumed from the timing that this was cut to include that commit. I 
>> should have looked.
>> Yes, it is not strictly a regression so does not have to block the release 
>> and this can pass. We can release 2.4.7 in a few months, too.
>> How important is the fix? If it's pretty important, it may still be useful 
>> to run one more RC, if it's not too much trouble.
>>
>> On Mon, May 18, 2020 at 11:25 AM Holden Karau  wrote:
>>>
>>> That is correct. I asked on the PR if that was ok with folks before I moved 
>>> forward with the RC and was told that it was ok. I believe that particular 
>>> bug is not a regression and is a long standing issue so we wouldn’t 
>>> normally block the release on it.
>>>
>>> On Mon, May 18, 2020 at 7:40 AM Xiao Li  wrote:
>>>>
>>>> This RC does not include the correctness bug fix 
>>>> https://github.com/apache/spark/commit/a4885f3654899bcb852183af70cc0a82e7dd81d0
>>>>  which is just after RC3 cut.
>>>>
>>>> On Mon, May 18, 2020 at 7:21 AM Tom Graves  
>>>> wrote:
>>>>>
>>>>> +1.
>>>>>
>>>>> Tom
>>>>>
>>>>> On Monday, May 18, 2020, 08:05:24 AM CDT, Wenchen Fan 
>>>>>  wrote:
>>>>>
>>>>>
>>>>> +1, no known blockers.
>>>>>
>>>>> On Mon, May 18, 2020 at 12:49 AM DB Tsai  wrote:
>>>>>
>>>>> +1 as well. Thanks.
>>>>>
>>>>> On Sun, May 17, 2020 at 7:39 AM Sean Owen  wrote:
>>>>>
>>>>> +1 , same response as to the last RC.
>>>>> This looks like it includes the fix discussed last time, as well as a
>>>>> few more small good fixes.
>>>>>
>>>>> On Sat, May 16, 2020 at 12:08 AM Holden Karau  
>>>>> wrote:
>>>>> >
>>>>> > Please vote on releasing the following candidate as Apache Spark 
>>>>> > version 2.4.6.
>>>>> >
>>>>> > The vote is open until May 22nd at 9AM PST and passes if a majority +1 
>>>>> > PMC votes are cast, with a minimum of 3 +1 votes.
>>>>> >
>>>>> > [ ] +1 Release this package as Apache Spark 2.4.6
>>>>> > [ ] -1 Do not release this package because ...
>>>>> >
>>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>>> >
>>>>> > There are currently no issues targeting 2.4.6 (try project = SPARK AND 
>>>>> > "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In 
>>>>> > Progress"))
>>>>> >
>>>>> > The tag to be voted on is v2.4.6-rc3 (commit 
>>>>> > 570848da7c48ba0cb827ada997e51677ff672a39):
>>>>> > https://github.com/apache/spark/tree/v2.4.6-rc3
>>>>> >
>>>>> > The release files, including signatures, digests, etc. can be found at:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-bin/
>>>>> >
>>>>> > Signatures used for Spark RCs can be found in this file:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>> >
>>>>> > The staging repository for this release can be found at:
>>>>> > https://repository.apache.org/content/repositories/orgapachespark-1344/
>>>>> >
>>>>> > The documentation corresponding to this release can be found at:
>>>>> > https://dist.apache.org/r

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-17 Thread DB Tsai

+1 as well. Thanks.

On Sun, May 17, 2020 at 7:39 AM Sean Owen  wrote:

> +1 , same response as to the last RC.
> This looks like it includes the fix discussed last time, as well as a
> few more small good fixes.
>
> On Sat, May 16, 2020 at 12:08 AM Holden Karau 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.6.
> >
> > The vote is open until May 22nd at 9AM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.6
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > There are currently no issues targeting 2.4.6 (try project = SPARK AND
> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
> >
> > The tag to be voted on is v2.4.6-rc3 (commit
> 570848da7c48ba0cb827ada997e51677ff672a39):
> > https://github.com/apache/spark/tree/v2.4.6-rc3
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1344/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-docs/
> >
> > The list of bug fixes going into 2.4.6 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12346781
> >
> > This release is using the release script of the tag v2.4.6-rc3.
> >
> > FAQ
> >
> > =
> > What happened to RC2?
> > =
> >
> > My computer crashed part of the way through RC2, so I rolled RC3.
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.6?
> > ===
> >
> > The current list of open tickets targeted at 2.4.6 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.6
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
- DB Sent from my iPhone

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-14 Thread DB Tsai

+1 Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Tue, Jan 14, 2020 at 11:08 AM Sean Owen  wrote:
>
> Yeah it's something about the env I spun up, but I don't know what. It
> happens frequently when I test, but not on Jenkins.
> The Kafka error comes up every now and then and a clean rebuild fixes
> it, but not in my case. I don't know why.
> But if nobody else sees it, I'm pretty sure it's just an artifact of
> the local VM.
>
> On Tue, Jan 14, 2020 at 12:57 PM Dongjoon Hyun  
> wrote:
> >
> > Thank you, Sean.
> >
> > First of all, the `Ubuntu` job on Amplab Jenkins farm is green.
> >
> > 
> > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7-ubuntu-testing/
> >
> > For the failures,
> >1. Yes, the `HiveExternalCatalogVersionsSuite` flakiness is a known one.
> >2. For `HDFSMetadataLogSuite` failure, I also observed a few time before 
> > in CentOS too.
> >3. Kafka build error is new to me. Does it happen on `Maven` clean build?
> >
> > Bests,
> > Dongjoon.
> >
> >
> > On Tue, Jan 14, 2020 at 6:40 AM Sean Owen  wrote:
> >>
> >> +1 from me. I checked sigs/licenses, and built/tested from source on
> >> Java 8 + Ubuntu 18.04 with " -Pyarn -Phive -Phive-thriftserver
> >> -Phadoop-2.7 -Pmesos -Pkubernetes -Psparkr -Pkinesis-asl". I do get
> >> test failures, but, these are some I have always seen on Ubuntu, and I
> >> do not know why they happen. They don't seem to affect others, but,
> >> let me know if anyone else sees these?
> >>
> >>
> >> Always happens for me:
> >>
> >> - HDFSMetadataLog: metadata directory collision *** FAILED ***
> >>   The await method on Waiter timed out. (HDFSMetadataLogSuite.scala:178)
> >>
> >> This one has been flaky at times due to external dependencies:
> >>
> >> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> >>   Exception encountered when invoking run on a nested suite -
> >> spark-submit returned with exit code 1.
> >>   Command line: './bin/spark-submit' '--name' 'prepare testing tables'
> >> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf'
> >> 'spark.master.rest.enabled=false' '--conf'
> >> 'spark.sql.warehouse.dir=/data/spark-2.4.5/sql/hive/target/tmp/warehouse-c2f762fd-688e-42b7-a822-06823a6bbd98'
> >> '--conf' 'spark.sql.test.version.index=0' '--driver-java-options'
> >> '-Dderby.system.home=/data/spark-2.4.5/sql/hive/target/tmp/warehouse-c2f762fd-688e-42b7-a822-06823a6bbd98'
> >> '/data/spark-2.4.5/sql/hive/target/tmp/test7297526474581770293.py'
> >>
> >> Kafka doesn't build with this weird error. I tried a clean build. I
> >> think we've seen this before.
> >>
> >> [error] This symbol is required by 'method
> >> org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
> >> [error] Make sure that term eclipse is in your classpath and check for
> >> conflicting dependencies with `-Ylog-classpath`.
> >> [error] A full rebuild may help if 'MetricsSystem.class' was compiled
> >> against an incompatible version of org.
> >> [error] testUtils.sendMessages(topic, data.toArray)
> >> [error]
> >>
> >> On Mon, Jan 13, 2020 at 6:28 AM Dongjoon Hyun  
> >> wrote:
> >> >
> >> > Please vote on releasing the following candidate as Apache Spark version 
> >> > 2.4.5.
> >> >
> >> > The vote is open until January 16th 5AM PST and passes if a majority +1 
> >> > PMC votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 2.4.5
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see http://spark.apache.org/
> >> >
> >> > The tag to be voted on is v2.4.5-rc1 (commit 
> >> > 33bd2beee5e3772a9af1d782f195e6a678c54cf0):
> >> > https://github.com/apache/spark/tree/v2.4.5-rc1
> >> >
> >> > The release files, including signatures, digests, etc. can be found at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc1-b

Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-28 Thread DB Tsai

+1

Thanks!

On Wed, Aug 28, 2019 at 7:14 AM Wenchen Fan  wrote:

> +1, no more blocking issues that I'm aware of.
>
> On Wed, Aug 28, 2019 at 8:33 PM Sean Owen  wrote:
>
>> +1 from me again.
>>
>> On Tue, Aug 27, 2019 at 6:06 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.4.
>> >
>> > The vote is open until August 30th 5PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.4.4
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.4.4-rc3 (commit
>> 7955b3962ac46b89564e0613db7bea98a1478bf2):
>> > https://github.com/apache/spark/tree/v2.4.4-rc3
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1332/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-docs/
>> >
>> > The list of bug fixes going into 2.4.4 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12345466
>> >
>> > This release is using the release script of the tag v2.4.4-rc3.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.4.4?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.4.4 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.4
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
- DB Sent from my iPhone

[DISCUSSION]JDK11 for Apache 2.x?

2019-08-27 Thread DB Tsai

Hello everyone,

Thank you all for working on supporting JDK11 in Apache Spark 3.0 as a 
community.

Java 8 is already end of life for commercial users, and many companies are 
moving to Java 11. 
The release date for Apache Spark 3.0 is still not there yet, and there are 
many API 
incompatibility issues when upgrading from Spark 2.x. As a result, asking users 
to move to
Spark 3.0 to use JDK 11 is not realistic.

Should we backport PRs for JDK11 and cut a release in 2.x to support JDK11?

Should we cut a new Apache Spark 2.5 since the patches involve some of the 
dependencies changes
which is not desired in minor release?

Thanks.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-27 Thread DB Tsai

+1

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Tue, Aug 27, 2019 at 11:31 AM Dongjoon Hyun  wrote:
>
> +1.
>
> I also verified SHA/GPG and tested UTs on AdoptOpenJDKu8_222/CentOS6.9 with 
> profile
> "-Pyarn -Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver"
>
> Additionally, JDBC IT also is tested.
>
> Thank you, Kazuaki!
>
> Bests,
> Dongjoon.
>
>
> On Tue, Aug 27, 2019 at 11:20 AM Sean Owen  wrote:
>>
>> +1 - license and signature looks OK, the docs look OK, the artifacts
>> seem to be in order. Tests passed for me when building from source
>> with most common profiles set.
>>
>> On Mon, Aug 26, 2019 at 3:28 PM Kazuaki Ishizaki  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark version 
>> > 2.3.4.
>> >
>> > The vote is open until August 29th 2PM PST and passes if a majority +1 PMC 
>> > votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.3.4
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> >
>> > The tag to be voted on is v2.3.4-rc1 (commit 
>> > 8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210):
>> > https://github.com/apache/spark/tree/v2.3.4-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1331/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-docs/
>> >
>> > The list of bug fixes going into 2.3.4 can be found at the following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12344844
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.3.4?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.3.4 can be found at:
>> > https://issues.apache.org/jira/projects/SPARKand search for "Target 
>> > Version/s" = 2.3.4
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: JDK11 Support in Apache Spark

2019-08-24 Thread DB Tsai

Congratulations on the great work!

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Sat, Aug 24, 2019 at 8:11 AM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Thanks to your many many contributions,
> Apache Spark master branch starts to pass on JDK11 as of today.
> (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)
>
> 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
> (JDK11 is used for building and testing.)
>
> We already verified all UTs (including PySpark/SparkR) before.
>
> Please feel free to use JDK11 in order to build/test/run `master` branch and
> share your experience including any issues. It will help Apache Spark 3.0.0 
> release.
>
> For the follow-ups, please follow 
> https://issues.apache.org/jira/browse/SPARK-24417 .
> The next step is `how to support JDK8/JDK11 together in a single artifact`.
>
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Release Apache Spark 2.4.4

2019-08-13 Thread DB Tsai

+1

On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released three months ago (8th May).
> As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24` 
> since 2.4.3.
>
> It would be great if we can have Spark 2.4.4.
> Shall we start `2.4.4 RC1` next Monday (19th August)?
>
> Last time, there was a request for K8s issue and now I'm waiting for 
> SPARK-27900.
> Please let me know if there is another issue.
>
> Thanks,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-24 Thread DB Tsai

+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
>
> It would allow for the columnar processing to be extended through the 
> shuffle.  So if I were doing say an FPGA accelerated extension it could 
> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as 
> input instead of a Row. The extended version of the ShuffleExchangeExec could 
> then do the partitioning on the incoming batch and instead of producing a 
> ShuffleRowRDD for the exchange they could produce something like a 
> ShuffleBatchRDD that would let the serializing and deserializing happen in a 
> column based format for a faster exchange, assuming that columnar processing 
> is also happening after the exchange. This is just like providing a columnar 
> version of any other catalyst operator, except in this case it is a bit more 
> complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid  
> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this 
>> extensions for dealing with shuffles, can you explain that part?  I don't 
>> see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves  wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau  wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I 
>>> > don’t have the cycles to review it in detail and make an informed vote 
>>> > until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng  wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't 
>>> >> feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick 
>>> >> check. Beside ColumnarBatch and ColumnarVector, we also need to make the 
>>> >> following public. People who are familiar with SQL internals should help 
>>> >> assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match 
>>> >> the purpose of this SPIP. It does make some code cleaner. But I guess 
>>> >> for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): 
>>> > https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-04 Thread DB Tsai

+user list

We are happy to announce the availability of Spark 2.4.1!

Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4
maintenance branch of Spark. We strongly recommend all 2.4.0 users to
upgrade to this stable release.

In Apache Spark 2.4.1, Scala 2.12 support is GA, and it's no longer
experimental.
We will drop Scala 2.11 support in Spark 3.0, so please provide us feedback.

To download Spark 2.4.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-4-1.html

One more thing: to add a little color to this release, it's the
largest RC ever (RC9)!
We tried to incorporate many critical fixes at the last minute, and
hope you all enjoy it.

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-04 Thread DB Tsai

We are happy to announce the availability of Spark 2.4.1!

Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4
maintenance branch of Spark. We strongly recommend all 2.4.0 users to
upgrade to this stable release.

In Apache Spark 2.4.1, Scala 2.12 support is GA, and it's no longer 
experimental.
We will drop Scala 2.11 support in Spark 3.0, so please provide us feedback.

To download Spark 2.4.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-4-1.html

One more thing: to add a little color to this release, it's the largest RC ever 
(RC9)!
We tried to incorporate many critical fixes at the last minute, and hope you 
all enjoy it.

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-31 Thread DB Tsai

This vote passes!

+1:
Wenchen Fan (binding)
Sean Owen (binding)
Mihaly Toth
DB Tsai (binding)
Jonatan Jäderberg
Xiao Li (binding)
Denny Lee
Felix Cheung (binding)

+0: None

-1: None

It's the largest RC ever; I will follow up with an official release
announcement soon.

Thank you all for your patience and participating Apache 2.4.1 release

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Fri, Mar 29, 2019 at 10:20 PM Felix Cheung  wrote:
>
> +1
>
> build source
> R tests
> R package CRAN check locally, r-hub
>
>
> 
> From: d_t...@apple.com on behalf of DB Tsai 
> Sent: Wednesday, March 27, 2019 11:31 AM
> To: dev
> Subject: [VOTE] Release Apache Spark 2.4.1 (RC9)
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.1.
>
> The vote is open until March 30 PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.1-rc9 (commit 
> 58301018003931454e93d8a309c7149cf84c279e):
> https://github.com/apache/spark/tree/v2.4.1-rc9
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1319/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-docs/
>
> The list of bug fixes going into 2.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.1?
> ===
>
> The current list of open tickets targeted at 2.4.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-28 Thread DB Tsai

+1 from myself

On Thu, Mar 28, 2019 at 3:14 AM Mihaly Toth 
wrote:

> +1 (non-binding)
>
> Thanks, Misi
>
> Sean Owen  ezt írta (időpont: 2019. márc. 28., Cs,
> 0:19):
>
>> +1 from me - same as last time.
>>
>> On Wed, Mar 27, 2019 at 1:31 PM DB Tsai  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.1.
>> >
>> > The vote is open until March 30 PST and passes if a majority +1 PMC
>> votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.4.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.4.1-rc9 (commit
>> 58301018003931454e93d8a309c7149cf84c279e):
>> > https://github.com/apache/spark/tree/v2.4.1-rc9
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1319/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-docs/
>> >
>> > The list of bug fixes going into 2.4.1 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.4.1?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.4.1 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.1
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> >
>> > DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
>> Apple, Inc
>> >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
- DB Sent from my iPhone

[VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-27 Thread DB Tsai

Please vote on releasing the following candidate as Apache Spark version 2.4.1.

The vote is open until March 30 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.1-rc9 (commit 
58301018003931454e93d8a309c7149cf84c279e):
https://github.com/apache/spark/tree/v2.4.1-rc9

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1319/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/2.4.1

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.1?
===

The current list of open tickets targeted at 2.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread DB Tsai

RC9 was just cut. Will send out another thread once the build is finished.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Mon, Mar 25, 2019 at 5:10 PM Sean Owen  wrote:
>
> That's all merged now. I think you're clear to start an RC.
>
> On Mon, Mar 25, 2019 at 4:06 PM DB Tsai  wrote:
> >
> > I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961
> > https://github.com/apache/spark/pull/24126 , anything critical that we
> > have to wait for 2.4.1 release? Thanks!
> >
> > Sincerely,
> >
> > DB Tsai
> > --
> > Web: https://www.dbtsai.com
> > PGP Key ID: 42E5B25A8F7A82C1
> >
> > On Sun, Mar 24, 2019 at 8:19 PM Sean Owen  wrote:
> > >
> > > Still waiting on a successful test - hope this one works.
> > >
> > > On Sun, Mar 24, 2019, 10:13 PM DB Tsai  wrote:
> > >>
> > >> Hello Sean,
> > >>
> > >> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we
> > >> can merge it into 2.4 branch soon?
> > >>
> > >> Sincerely,
> > >>
> > >> DB Tsai
> > >> --
> > >> Web: https://www.dbtsai.com
> > >> PGP Key ID: 42E5B25A8F7A82C1
> > >>
> > >> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen  wrote:
> > >> >
> > >> > I think we can/should get in SPARK-26961 too; it's all but ready to 
> > >> > commit.
> > >> >
> > >> > On Sat, Mar 23, 2019 at 2:02 PM DB Tsai  wrote:
> > >> > >
> > >> > > -1
> > >> > >
> > >> > > I will fail RC8, and cut another RC9 on Monday to include 
> > >> > > SPARK-27160,
> > >> > > SPARK-27178, SPARK-27112. Please let me know if there is any critical
> > >> > > PR that has to be back-ported into branch-2.4.
> > >> > >
> > >> > > Thanks.
> > >> > >
> > >> > > Sincerely,
> > >> > >
> > >> > > DB Tsai
> > >> > > --
> > >> > > Web: https://www.dbtsai.com
> > >> > > PGP Key ID: 42E5B25A8F7A82C1
> > >> > >
> > >> > > On Fri, Mar 22, 2019 at 12:28 AM DB Tsai  wrote:
> > >> > > >
> > >> > > > Since we have couple concerns and hesitations to release rc8, how
> > >> > > > about we give it couple days, and have another vote on March 25,
> > >> > > > Monday? In this case, I will cut another rc9 in the Monday morning.
> > >> > > >
> > >> > > > Darcy, as Dongjoon mentioned,
> > >> > > > https://github.com/apache/spark/pull/24092 is conflict against
> > >> > > > branch-2.4, can you make anther PR against branch-2.4 so we can
> > >> > > > include the ORC fix in 2.4.1?
> > >> > > >
> > >> > > > Thanks.
> > >> > > >
> > >> > > > Sincerely,
> > >> > > >
> > >> > > > DB Tsai
> > >> > > > --
> > >> > > > Web: https://www.dbtsai.com
> > >> > > > PGP Key ID: 42E5B25A8F7A82C1
> > >> > > >
> > >> > > > On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung 
> > >> > > >  wrote:
> > >> > > > >
> > >> > > > > Reposting for shane here
> > >> > > > >
> > >> > > > > [SPARK-27178]
> > >> > > > > https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> > >> > > > >
> > >> > > > > Is problematic too and it’s not in the rc8 cut
> > >> > > > >
> > >> > > > > https://github.com/apache/spark/commits/branch-2.4
> > >> > > > >
> > >> > > > > (Personally I don’t want to delay 2.4.1 either..)
> > >> > > > >
> > >> > > > > 
> > >> > > > > From: Sean Owen 
> > >> > > > > Sent: W

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread DB Tsai

I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961
https://github.com/apache/spark/pull/24126 , anything critical that we
have to wait for 2.4.1 release? Thanks!

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Sun, Mar 24, 2019 at 8:19 PM Sean Owen  wrote:
>
> Still waiting on a successful test - hope this one works.
>
> On Sun, Mar 24, 2019, 10:13 PM DB Tsai  wrote:
>>
>> Hello Sean,
>>
>> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we
>> can merge it into 2.4 branch soon?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen  wrote:
>> >
>> > I think we can/should get in SPARK-26961 too; it's all but ready to commit.
>> >
>> > On Sat, Mar 23, 2019 at 2:02 PM DB Tsai  wrote:
>> > >
>> > > -1
>> > >
>> > > I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
>> > > SPARK-27178, SPARK-27112. Please let me know if there is any critical
>> > > PR that has to be back-ported into branch-2.4.
>> > >
>> > > Thanks.
>> > >
>> > > Sincerely,
>> > >
>> > > DB Tsai
>> > > --
>> > > Web: https://www.dbtsai.com
>> > > PGP Key ID: 42E5B25A8F7A82C1
>> > >
>> > > On Fri, Mar 22, 2019 at 12:28 AM DB Tsai  wrote:
>> > > >
>> > > > Since we have couple concerns and hesitations to release rc8, how
>> > > > about we give it couple days, and have another vote on March 25,
>> > > > Monday? In this case, I will cut another rc9 in the Monday morning.
>> > > >
>> > > > Darcy, as Dongjoon mentioned,
>> > > > https://github.com/apache/spark/pull/24092 is conflict against
>> > > > branch-2.4, can you make anther PR against branch-2.4 so we can
>> > > > include the ORC fix in 2.4.1?
>> > > >
>> > > > Thanks.
>> > > >
>> > > > Sincerely,
>> > > >
>> > > > DB Tsai
>> > > > --
>> > > > Web: https://www.dbtsai.com
>> > > > PGP Key ID: 42E5B25A8F7A82C1
>> > > >
>> > > > On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung 
>> > > >  wrote:
>> > > > >
>> > > > > Reposting for shane here
>> > > > >
>> > > > > [SPARK-27178]
>> > > > > https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
>> > > > >
>> > > > > Is problematic too and it’s not in the rc8 cut
>> > > > >
>> > > > > https://github.com/apache/spark/commits/branch-2.4
>> > > > >
>> > > > > (Personally I don’t want to delay 2.4.1 either..)
>> > > > >
>> > > > > 
>> > > > > From: Sean Owen 
>> > > > > Sent: Wednesday, March 20, 2019 11:18 AM
>> > > > > To: DB Tsai
>> > > > > Cc: dev
>> > > > > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
>> > > > >
>> > > > > +1 for this RC. The tag is correct, licenses and sigs check out, 
>> > > > > tests
>> > > > > of the source with most profiles enabled works for me.
>> > > > >
>> > > > > On Tue, Mar 19, 2019 at 5:28 PM DB Tsai  
>> > > > > wrote:
>> > > > > >
>> > > > > > Please vote on releasing the following candidate as Apache Spark 
>> > > > > > version 2.4.1.
>> > > > > >
>> > > > > > The vote is open until March 23 PST and passes if a majority +1 
>> > > > > > PMC votes are cast, with
>> > > > > > a minimum of 3 +1 votes.
>> > > > > >
>> > > > > > [ ] +1 Release this package as Apache Spark 2.4.1
>> > > > > > [ ] -1 Do not release this package because ...
>> > > > > >
>> > > > > > To learn more about Apache Spark, please see 
>> > >

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-24 Thread DB Tsai

Hello Sean,

By looking at SPARK-26961 PR, seems it's ready to go. Do you think we
can merge it into 2.4 branch soon?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Sat, Mar 23, 2019 at 12:04 PM Sean Owen  wrote:
>
> I think we can/should get in SPARK-26961 too; it's all but ready to commit.
>
> On Sat, Mar 23, 2019 at 2:02 PM DB Tsai  wrote:
> >
> > -1
> >
> > I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
> > SPARK-27178, SPARK-27112. Please let me know if there is any critical
> > PR that has to be back-ported into branch-2.4.
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > --
> > Web: https://www.dbtsai.com
> > PGP Key ID: 42E5B25A8F7A82C1
> >
> > On Fri, Mar 22, 2019 at 12:28 AM DB Tsai  wrote:
> > >
> > > Since we have couple concerns and hesitations to release rc8, how
> > > about we give it couple days, and have another vote on March 25,
> > > Monday? In this case, I will cut another rc9 in the Monday morning.
> > >
> > > Darcy, as Dongjoon mentioned,
> > > https://github.com/apache/spark/pull/24092 is conflict against
> > > branch-2.4, can you make anther PR against branch-2.4 so we can
> > > include the ORC fix in 2.4.1?
> > >
> > > Thanks.
> > >
> > > Sincerely,
> > >
> > > DB Tsai
> > > --
> > > Web: https://www.dbtsai.com
> > > PGP Key ID: 42E5B25A8F7A82C1
> > >
> > > On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung  
> > > wrote:
> > > >
> > > > Reposting for shane here
> > > >
> > > > [SPARK-27178]
> > > > https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> > > >
> > > > Is problematic too and it’s not in the rc8 cut
> > > >
> > > > https://github.com/apache/spark/commits/branch-2.4
> > > >
> > > > (Personally I don’t want to delay 2.4.1 either..)
> > > >
> > > > 
> > > > From: Sean Owen 
> > > > Sent: Wednesday, March 20, 2019 11:18 AM
> > > > To: DB Tsai
> > > > Cc: dev
> > > > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
> > > >
> > > > +1 for this RC. The tag is correct, licenses and sigs check out, tests
> > > > of the source with most profiles enabled works for me.
> > > >
> > > > On Tue, Mar 19, 2019 at 5:28 PM DB Tsai  
> > > > wrote:
> > > > >
> > > > > Please vote on releasing the following candidate as Apache Spark 
> > > > > version 2.4.1.
> > > > >
> > > > > The vote is open until March 23 PST and passes if a majority +1 PMC 
> > > > > votes are cast, with
> > > > > a minimum of 3 +1 votes.
> > > > >
> > > > > [ ] +1 Release this package as Apache Spark 2.4.1
> > > > > [ ] -1 Do not release this package because ...
> > > > >
> > > > > To learn more about Apache Spark, please see http://spark.apache.org/
> > > > >
> > > > > The tag to be voted on is v2.4.1-rc8 (commit 
> > > > > 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> > > > > https://github.com/apache/spark/tree/v2.4.1-rc8
> > > > >
> > > > > The release files, including signatures, digests, etc. can be found 
> > > > > at:
> > > > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
> > > > >
> > > > > Signatures used for Spark RCs can be found in this file:
> > > > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > > > >
> > > > > The staging repository for this release can be found at:
> > > > > https://repository.apache.org/content/repositories/orgapachespark-1318/
> > > > >
> > > > > The documentation corresponding to this release can be found at:
> > > > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-docs/
> > > > >
> > > > > The list of bug fixes going into 2.4.1 can be found at the following 
> > > > > URL:
> > > > > https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
> > > > >
> > >

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-23 Thread DB Tsai

-1

I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
SPARK-27178, SPARK-27112. Please let me know if there is any critical
PR that has to be back-ported into branch-2.4.

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Fri, Mar 22, 2019 at 12:28 AM DB Tsai  wrote:
>
> Since we have couple concerns and hesitations to release rc8, how
> about we give it couple days, and have another vote on March 25,
> Monday? In this case, I will cut another rc9 in the Monday morning.
>
> Darcy, as Dongjoon mentioned,
> https://github.com/apache/spark/pull/24092 is conflict against
> branch-2.4, can you make anther PR against branch-2.4 so we can
> include the ORC fix in 2.4.1?
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung  
> wrote:
> >
> > Reposting for shane here
> >
> > [SPARK-27178]
> > https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> >
> > Is problematic too and it’s not in the rc8 cut
> >
> > https://github.com/apache/spark/commits/branch-2.4
> >
> > (Personally I don’t want to delay 2.4.1 either..)
> >
> > 
> > From: Sean Owen 
> > Sent: Wednesday, March 20, 2019 11:18 AM
> > To: DB Tsai
> > Cc: dev
> > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
> >
> > +1 for this RC. The tag is correct, licenses and sigs check out, tests
> > of the source with most profiles enabled works for me.
> >
> > On Tue, Mar 19, 2019 at 5:28 PM DB Tsai  wrote:
> > >
> > > Please vote on releasing the following candidate as Apache Spark version 
> > > 2.4.1.
> > >
> > > The vote is open until March 23 PST and passes if a majority +1 PMC votes 
> > > are cast, with
> > > a minimum of 3 +1 votes.
> > >
> > > [ ] +1 Release this package as Apache Spark 2.4.1
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see http://spark.apache.org/
> > >
> > > The tag to be voted on is v2.4.1-rc8 (commit 
> > > 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> > > https://github.com/apache/spark/tree/v2.4.1-rc8
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
> > >
> > > Signatures used for Spark RCs can be found in this file:
> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >
> > > The staging repository for this release can be found at:
> > > https://repository.apache.org/content/repositories/orgapachespark-1318/
> > >
> > > The documentation corresponding to this release can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-docs/
> > >
> > > The list of bug fixes going into 2.4.1 can be found at the following URL:
> > > https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
> > >
> > > FAQ
> > >
> > > =
> > > How can I help test this release?
> > > =
> > >
> > > If you are a Spark user, you can help us test this release by taking
> > > an existing Spark workload and running on this release candidate, then
> > > reporting any regressions.
> > >
> > > If you're working in PySpark you can set up a virtual env and install
> > > the current RC and see if anything important breaks, in the Java/Scala
> > > you can add the staging repository to your projects resolvers and test
> > > with the RC (make sure to clean up the artifact cache before/after so
> > > you don't end up building with a out of date RC going forward).
> > >
> > > ===
> > > What should happen to JIRA tickets still targeting 2.4.1?
> > > ===
> > >
> > > The current list of open tickets targeted at 2.4.1 can be found at:
> > > https://issues.apache.org/jira/projects/SPARK and search for "Target 
> > > Version/s" = 2.4.1
> > >
> > > Committers should look at those and triage. Extremely important bug
> > > fixes, documentation, and API tweaks that impact compatibility should
>

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-22 Thread DB Tsai

Since we have couple concerns and hesitations to release rc8, how
about we give it couple days, and have another vote on March 25,
Monday? In this case, I will cut another rc9 in the Monday morning.

Darcy, as Dongjoon mentioned,
https://github.com/apache/spark/pull/24092 is conflict against
branch-2.4, can you make anther PR against branch-2.4 so we can
include the ORC fix in 2.4.1?

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung  wrote:
>
> Reposting for shane here
>
> [SPARK-27178]
> https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
>
> Is problematic too and it’s not in the rc8 cut
>
> https://github.com/apache/spark/commits/branch-2.4
>
> (Personally I don’t want to delay 2.4.1 either..)
>
> 
> From: Sean Owen 
> Sent: Wednesday, March 20, 2019 11:18 AM
> To: DB Tsai
> Cc: dev
> Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
>
> +1 for this RC. The tag is correct, licenses and sigs check out, tests
> of the source with most profiles enabled works for me.
>
> On Tue, Mar 19, 2019 at 5:28 PM DB Tsai  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version 
> > 2.4.1.
> >
> > The vote is open until March 23 PST and passes if a majority +1 PMC votes 
> > are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.1-rc8 (commit 
> > 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> > https://github.com/apache/spark/tree/v2.4.1-rc8
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1318/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-docs/
> >
> > The list of bug fixes going into 2.4.1 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.1?
> > ===
> >
> > The current list of open tickets targeted at 2.4.1 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target 
> > Version/s" = 2.4.1
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> >
> > DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-20 Thread DB Tsai

Unfortunately, for this 2.4.1 RC cuts, we ran into couple critical bug
fixes unexpectedly just right after each RC was cut, and some of the
bugs were even found after the RC cut which is hard to know there is
still a blocker beforehand.

How about we start to test out the RC8 now given the differences
between RC8 and 2.4.0 are big? If an issue is found to justify to fail
RC8, we can include SPARK-27112 and SPARK-27160 in next cut. Thus,
even we decide to cut another RC, it will be easier to test.

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Wed, Mar 20, 2019 at 9:48 AM dhruve ashar  wrote:
>
> I agree with Imran on this. Since we are already on rc8, we don't want to 
> indefinitely hold the release for one more fix, but it this one is a severe 
> deadlock.
>
> Note to myself and other community members: May be we can be more proactive 
> in checking and reporting inprogress PR reviews/JIRAs for any blockers as 
> early or as soon as the first RC is cut, so that we don't hold the release 
> process. I believe in this case, the merge commit and RC were in a very close 
> time frame.
>
> On Wed, Mar 20, 2019 at 11:32 AM Imran Rashid  wrote:
>>
>> Even if only PMC are able to veto a release, I believe all community members 
>> are encouraged to vote, even a -1, to express their opinions, right?
>>
>> I am -0.5 on the release because of SPARK-27112.  It is not a regression, so 
>> in that sense I don't think it must hold the release.  But it is fixing a 
>> pretty bad deadlock.
>>
>> that said, I'm only -0.5 because (a) I don't want to keep holding the 
>> release indefinitely for "one more fix" and (b) this will probably only hit 
>> users running on large clusters -- probably sophisticated enough users to 
>> apply their own set of patches.  I'd prefer we cut another rc with the fix, 
>> but understand the tradeoffs here.
>>
>> On Wed, Mar 20, 2019 at 10:17 AM Sean Owen  wrote:
>>>
>>> Is it a regression from 2.4.0? that's not the only criteria but part of it.
>>> The version link is 
>>> https://issues.apache.org/jira/projects/SPARK/versions/12344117
>>>
>>> On Wed, Mar 20, 2019 at 10:15 AM dhruve ashar  wrote:
>>>>
>>>> A deadlock bug was recently fixed and backported to 2.4, but the rc was 
>>>> cut before that. I think we should include a critical bug like that in the 
>>>> current rc.
>>>>
>>>> issue: https://issues.apache.org/jira/browse/SPARK-27112
>>>> commit: 
>>>> https://github.com/apache/spark/commit/95e73b328ac883be2ced9099f20c8878e498e297
>>>>
>>>> I am hitting a deadlink while checking: 
>>>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>>> I don't know what the right url looks like, but we should fix it.
>>>>
>>>> On Wed, Mar 20, 2019 at 9:14 AM Stavros Kontopoulos 
>>>>  wrote:
>>>>>
>>>>> +1  (non-binding)
>>>>>
>>>>> On Wed, Mar 20, 2019 at 8:33 AM Sean Owen  wrote:
>>>>>>
>>>>>> (Only the PMC can veto a release)
>>>>>> That doesn't look like a regression. I get that it's important, but I
>>>>>> don't see that it should block this release.
>>>>>>
>>>>>> On Tue, Mar 19, 2019 at 11:00 PM Darcy Shen  
>>>>>> wrote:
>>>>>> >
>>>>>> > -1
>>>>>> >
>>>>>> > please backpoart SPARK-27160, a correctness issue about ORC native 
>>>>>> > reader.
>>>>>> >
>>>>>> > see https://github.com/apache/spark/pull/24092
>>>>>> >
>>>>>> >
>>>>>> >  On Wed, 20 Mar 2019 06:21:29 +0800 DB Tsai 
>>>>>> >  wrote 
>>>>>> >
>>>>>> > Please vote on releasing the following candidate as Apache Spark 
>>>>>> > version 2.4.1.
>>>>>> >
>>>>>> > The vote is open until March 23 PST and passes if a majority +1 PMC 
>>>>>> > votes are cast, with
>>>>>> > a minimum of 3 +1 votes.
>>>>>> >
>>>>>> > [ ] +1 Release this package as Apache Spark 2.4.1
>&g

[VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-19 Thread DB Tsai

Please vote on releasing the following candidate as Apache Spark version 2.4.1.

The vote is open until March 23 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.1-rc8 (commit 
746b3ddee6f7ad3464e326228ea226f5b1f39a41):
https://github.com/apache/spark/tree/v2.4.1-rc8

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1318/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/2.4.1

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.1?
===

The current list of open tickets targeted at 2.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [discuss] 2.4.1-rcX release, k8s client PRs, build system infrastructure update

2019-03-14 Thread DB Tsai

Since rc8 was already cut without the k8s client upgrade; the build is
ready to vote, and including k8s client upgrade in 2.4.1 implies that
we will drop the old-but-not-that-old
K8S versions as Sean mentioned, should we include this upgrade in 2.4.2?

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Thu, Mar 14, 2019 at 9:48 AM shane knapp  wrote:
>
> thanks everyone, both PRs are merged.  :)
>
> On Wed, Mar 13, 2019 at 3:51 PM shane knapp  wrote:
>>
>> btw, let's wait and see if the non-k8s PRB tests pass before merging 
>> https://github.com/apache/spark/pull/23993 in to 2.4.1
>>
>> On Wed, Mar 13, 2019 at 3:42 PM shane knapp  wrote:
>>>
>>> 2.4.1 k8s integration test passed:
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8875/
>>>
>>> thanks everyone!  :)
>>>
>>> On Wed, Mar 13, 2019 at 3:24 PM shane knapp  wrote:
>>>>
>>>> 2.4.1 integration tests running:  
>>>> https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8875/
>>>>
>>>> On Wed, Mar 13, 2019 at 3:15 PM shane knapp  wrote:
>>>>>
>>>>> upgrade completed, jenkins building again...  master PR merged, waiting 
>>>>> for the 2.4.1 PR to launch the k8s integration tests.
>>>>>
>>>>> On Wed, Mar 13, 2019 at 2:55 PM shane knapp  wrote:
>>>>>>
>>>>>> okie dokie!  the time approacheth!
>>>>>>
>>>>>> i'll pause jenkins @ 3pm to not accept new jobs.  i don't expect the 
>>>>>> upgrade to take more than 15-20 mins, following which i will re-enable 
>>>>>> builds.
>>>>>>
>>>>>> On Wed, Mar 13, 2019 at 12:17 PM shane knapp  wrote:
>>>>>>>
>>>>>>> ok awesome.  let's shoot for 3pm PST.
>>>>>>>
>>>>>>> On Wed, Mar 13, 2019 at 11:59 AM Marcelo Vanzin  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Wed, Mar 13, 2019 at 11:53 AM shane knapp  
>>>>>>>> wrote:
>>>>>>>> > On Wed, Mar 13, 2019 at 11:49 AM Marcelo Vanzin 
>>>>>>>> >  wrote:
>>>>>>>> >>
>>>>>>>> >> Do the upgraded minikube/k8s versions break the current master 
>>>>>>>> >> client
>>>>>>>> >> version too?
>>>>>>>> >>
>>>>>>>> > yes.
>>>>>>>>
>>>>>>>> Ah, so that part kinda sucks.
>>>>>>>>
>>>>>>>> Let's do this: since the master PR is good to go pending the minikube
>>>>>>>> upgrade, let's try to synchronize things. Set a time to do the
>>>>>>>> minikube upgrade this PM, if that works for you, and I'll merge that
>>>>>>>> PR once it's done. Then I'll take care of backporting it to 2.4 and
>>>>>>>> make sure it passes the integration tests.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Marcelo
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shane Knapp
>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>>> https://rise.cs.berkeley.edu
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shane Knapp
>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>
>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread DB Tsai

As we have many important fixes in 2.4 branch which we want to release
asap, and this is is not a regression from Spark 2.4; as a result, 2.4.1
will be not blocked by this.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0359BC9965359766


On Sun, Mar 10, 2019 at 3:08 PM Michael Heuer  wrote:

> Any chance we could get some movement on this for 2.4.1?
>
> https://issues.apache.org/jira/browse/SPARK-25588
> https://github.com/apache/parquet-mr/pull/560
>
> It would require a new Parquet release, which would then need to be picked
> up by Spark.  We're dead in the water on 2.4.0 without a large refactoring
> (remove all the RDD code paths for reading Avro stored in Parquet).
>
>michael
>
>
> On Mar 8, 2019, at 6:22 PM, Sean Owen  wrote:
>
> FWIW RC6 looked fine to me. Passed all tests, etc.
>
> On Fri, Mar 8, 2019 at 6:09 PM DB Tsai  wrote:
>
>> Sounds fair to me. I'll cut another rc7 when the PR is merged. Hopefully,
>> this is the final rc. Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>>
>> On Fri, Mar 8, 2019 at 3:23 PM Xiao Li  wrote:
>>
>>> It is common to hit this issue when driver and executors are different
>>> object layout, but Spark might not return a wrong answer. It is very hard
>>> to find out the root cause. Thus, I would suggest to include it in Spark
>>> 2.4.1.
>>>
>>> On Fri, Mar 8, 2019 at 3:13 PM DB Tsai  wrote:
>>>
>>>> BTW, practically, is it common for users running into this bug when the
>>>> driver and executors have different object layout?
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> --
>>>> Web: https://www.dbtsai.com
>>>> PGP Key ID: 42E5B25A8F7A82C1
>>>>
>>>>
>>>> On Fri, Mar 8, 2019 at 3:00 PM DB Tsai  wrote:
>>>>
>>>>> Hi Xiao,
>>>>>
>>>>> I already cut rc7 and start the build process. If we definitely need
>>>>> this fix, I can cut rc8. Let me know what you think.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> On Fri, Mar 8, 2019 at 1:46 PM Xiao Li  wrote:
>>>>>
>>>>>> Hi, DB,
>>>>>>
>>>>>> Since this RC will fail, could you hold it until we fix
>>>>>> https://issues.apache.org/jira/browse/SPARK-27097? Either Kris or I
>>>>>> will submit a PR today. The PR is small and the risk is low. This is a
>>>>>> correctness bug. It would be good to have it.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 8, 2019 at 12:17 PM DB Tsai 
>>>>>> wrote:
>>>>>>
>>>>>>> Since I can not find the commit of `Preparing development version
>>>>>>> 2.4.2-SNAPSHOT` after rc6 cut, it's very risky to fix the branch and do 
>>>>>>> a
>>>>>>> force-push. I'll follow Marcelo's suggestion to have another rc7 cut. 
>>>>>>> Thus,
>>>>>>> this vote fails.
>>>>>>>
>>>>>>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
>>>>>>> Apple, Inc
>>>>>>>
>>>>>>> > On Mar 8, 2019, at 11:45 AM, DB Tsai 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Okay, I see the problem. rc6 tag is not in the 2.4 branch. It's
>>>>>>> very weird. It must be overwritten by a force push.
>>>>>>> >
>>>>>>> > DB Tsai  |  Siri Open Source Technologies [not a contribution]  |
>>>>>>>  Apple, Inc
>>>>>>> >
>>>>>>> >> On Mar 8, 2019, at 11:39 AM, DB Tsai 
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> I was using `./do-release-docker.sh` to create a release. But
>>>>>>> since the gpg validation fails couple times when the script tried to
>>>>>>> publish the jars into Nexus, I re-ran the

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-08 Thread DB Tsai

Since I can not find the commit of `Preparing development version 
2.4.2-SNAPSHOT` after rc6 cut, it's very risky to fix the branch and do a 
force-push. I'll follow Marcelo's suggestion to have another rc7 cut. Thus, 
this vote fails.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Mar 8, 2019, at 11:45 AM, DB Tsai  wrote:
> 
> Okay, I see the problem. rc6 tag is not in the 2.4 branch. It's very weird. 
> It must be overwritten by a force push.
> 
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
> Inc
> 
>> On Mar 8, 2019, at 11:39 AM, DB Tsai  wrote:
>> 
>> I was using `./do-release-docker.sh` to create a release. But since the gpg 
>> validation fails couple times when the script tried to publish the jars into 
>> Nexus, I re-ran the scripts multiple times without creating a new rc. I was 
>> wondering if the script will overwrite the v.2.4.1-rc6 tag instead of using 
>> the same commit causing this issue.
>> 
>> Should we create a new rc7?
>> 
>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
>> Inc
>> 
>>> On Mar 8, 2019, at 10:54 AM, Marcelo Vanzin  
>>> wrote:
>>> 
>>> I personally find it a little weird to not have the commit in branch-2.4.
>>> 
>>> Not that this would happen, but if the v2.4.1-rc6 tag is overwritten
>>> (e.g. accidentally) then you lose the reference to that commit, and
>>> then the exact commit from which the rc was generated is lost.
>>> 
>>> On Fri, Mar 8, 2019 at 7:49 AM Sean Owen  wrote:
>>>> 
>>>> That's weird. I see the commit but can't find it in the branch. Was it 
>>>> pushed, or lost in a force push of 2.4 along the way? The change is there, 
>>>> just under a different commit in the 2.4 branch.
>>>> 
>>>> It doesn't necessarily invalidate the RC as it is a valid public tagged 
>>>> commit and all that. I just want to be sure we do have the code from that 
>>>> commit in these tatballs. It looks like it.
>>>> 
>>>> On Fri, Mar 8, 2019, 4:14 AM Mihály Tóth  wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am not sure how problematic it is but v2.4.1-rc6 is not on branch-2.4. 
>>>>> Release related commits I have seen so far were also part of the branch.
>>>>> 
>>>>> I guess the "Preparing Spark release v2.4.1-rc6" and "Preparing 
>>>>> development version 2.4.2-SNAPSHOT" commits were simply not pushed to 
>>>>> spark-2.4 just the tag itself was pushed. I dont know what is the 
>>>>> practice in such cases but one solution is to rebase branch-2.4 changes 
>>>>> after 3336a21 onto these commits and do a (sorry) force push. In this 
>>>>> case there is no impact on this RC.
>>>>> 
>>>>> Best Regards,
>>>>> 
>>>>> Misi
>>>>> 
>>>>> DB Tsai  ezt írta (időpont: 2019. márc. 8., P, 
>>>>> 1:15):
>>>>>> 
>>>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>>>> 2.4.1.
>>>>>> 
>>>>>> The vote is open until March 11 PST and passes if a majority +1 PMC 
>>>>>> votes are cast, with
>>>>>> a minimum of 3 +1 votes.
>>>>>> 
>>>>>> [ ] +1 Release this package as Apache Spark 2.4.1
>>>>>> [ ] -1 Do not release this package because ...
>>>>>> 
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>> 
>>>>>> The tag to be voted on is v2.4.1-rc6 (commit 
>>>>>> 201ec8c9b46f9d037cc2e3a5d9c896b9840ca1bc):
>>>>>> https://github.com/apache/spark/tree/v2.4.1-rc6
>>>>>> 
>>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-bin/
>>>>>> 
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>> 
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1308/
>>>>>> 
>>>>>

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-08 Thread DB Tsai

Okay, I see the problem. rc6 tag is not in the 2.4 branch. It's very weird. It 
must be overwritten by a force push.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Mar 8, 2019, at 11:39 AM, DB Tsai  wrote:
> 
> I was using `./do-release-docker.sh` to create a release. But since the gpg 
> validation fails couple times when the script tried to publish the jars into 
> Nexus, I re-ran the scripts multiple times without creating a new rc. I was 
> wondering if the script will overwrite the v.2.4.1-rc6 tag instead of using 
> the same commit causing this issue.
> 
> Should we create a new rc7?
> 
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
> Inc
> 
>> On Mar 8, 2019, at 10:54 AM, Marcelo Vanzin  
>> wrote:
>> 
>> I personally find it a little weird to not have the commit in branch-2.4.
>> 
>> Not that this would happen, but if the v2.4.1-rc6 tag is overwritten
>> (e.g. accidentally) then you lose the reference to that commit, and
>> then the exact commit from which the rc was generated is lost.
>> 
>> On Fri, Mar 8, 2019 at 7:49 AM Sean Owen  wrote:
>>> 
>>> That's weird. I see the commit but can't find it in the branch. Was it 
>>> pushed, or lost in a force push of 2.4 along the way? The change is there, 
>>> just under a different commit in the 2.4 branch.
>>> 
>>> It doesn't necessarily invalidate the RC as it is a valid public tagged 
>>> commit and all that. I just want to be sure we do have the code from that 
>>> commit in these tatballs. It looks like it.
>>> 
>>> On Fri, Mar 8, 2019, 4:14 AM Mihály Tóth  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I am not sure how problematic it is but v2.4.1-rc6 is not on branch-2.4. 
>>>> Release related commits I have seen so far were also part of the branch.
>>>> 
>>>> I guess the "Preparing Spark release v2.4.1-rc6" and "Preparing 
>>>> development version 2.4.2-SNAPSHOT" commits were simply not pushed to 
>>>> spark-2.4 just the tag itself was pushed. I dont know what is the practice 
>>>> in such cases but one solution is to rebase branch-2.4 changes after 
>>>> 3336a21 onto these commits and do a (sorry) force push. In this case there 
>>>> is no impact on this RC.
>>>> 
>>>> Best Regards,
>>>> 
>>>> Misi
>>>> 
>>>> DB Tsai  ezt írta (időpont: 2019. márc. 8., P, 
>>>> 1:15):
>>>>> 
>>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>>> 2.4.1.
>>>>> 
>>>>> The vote is open until March 11 PST and passes if a majority +1 PMC votes 
>>>>> are cast, with
>>>>> a minimum of 3 +1 votes.
>>>>> 
>>>>> [ ] +1 Release this package as Apache Spark 2.4.1
>>>>> [ ] -1 Do not release this package because ...
>>>>> 
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>> 
>>>>> The tag to be voted on is v2.4.1-rc6 (commit 
>>>>> 201ec8c9b46f9d037cc2e3a5d9c896b9840ca1bc):
>>>>> https://github.com/apache/spark/tree/v2.4.1-rc6
>>>>> 
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-bin/
>>>>> 
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>> 
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1308/
>>>>> 
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-docs/
>>>>> 
>>>>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>>>> 
>>>>> FAQ
>>>>> 
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>> 
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and runn

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-08 Thread DB Tsai

I was using `./do-release-docker.sh` to create a release. But since the gpg 
validation fails couple times when the script tried to publish the jars into 
Nexus, I re-ran the scripts multiple times without creating a new rc. I was 
wondering if the script will overwrite the v.2.4.1-rc6 tag instead of using the 
same commit causing this issue.

Should we create a new rc7?

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Mar 8, 2019, at 10:54 AM, Marcelo Vanzin  
> wrote:
> 
> I personally find it a little weird to not have the commit in branch-2.4.
> 
> Not that this would happen, but if the v2.4.1-rc6 tag is overwritten
> (e.g. accidentally) then you lose the reference to that commit, and
> then the exact commit from which the rc was generated is lost.
> 
> On Fri, Mar 8, 2019 at 7:49 AM Sean Owen  wrote:
>> 
>> That's weird. I see the commit but can't find it in the branch. Was it 
>> pushed, or lost in a force push of 2.4 along the way? The change is there, 
>> just under a different commit in the 2.4 branch.
>> 
>> It doesn't necessarily invalidate the RC as it is a valid public tagged 
>> commit and all that. I just want to be sure we do have the code from that 
>> commit in these tatballs. It looks like it.
>> 
>> On Fri, Mar 8, 2019, 4:14 AM Mihály Tóth  wrote:
>>> 
>>> Hi,
>>> 
>>> I am not sure how problematic it is but v2.4.1-rc6 is not on branch-2.4. 
>>> Release related commits I have seen so far were also part of the branch.
>>> 
>>> I guess the "Preparing Spark release v2.4.1-rc6" and "Preparing development 
>>> version 2.4.2-SNAPSHOT" commits were simply not pushed to spark-2.4 just 
>>> the tag itself was pushed. I dont know what is the practice in such cases 
>>> but one solution is to rebase branch-2.4 changes after 3336a21 onto these 
>>> commits and do a (sorry) force push. In this case there is no impact on 
>>> this RC.
>>> 
>>> Best Regards,
>>> 
>>> Misi
>>> 
>>> DB Tsai  ezt írta (időpont: 2019. márc. 8., P, 
>>> 1:15):
>>>> 
>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>> 2.4.1.
>>>> 
>>>> The vote is open until March 11 PST and passes if a majority +1 PMC votes 
>>>> are cast, with
>>>> a minimum of 3 +1 votes.
>>>> 
>>>> [ ] +1 Release this package as Apache Spark 2.4.1
>>>> [ ] -1 Do not release this package because ...
>>>> 
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>> 
>>>> The tag to be voted on is v2.4.1-rc6 (commit 
>>>> 201ec8c9b46f9d037cc2e3a5d9c896b9840ca1bc):
>>>> https://github.com/apache/spark/tree/v2.4.1-rc6
>>>> 
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-bin/
>>>> 
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> 
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1308/
>>>> 
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-docs/
>>>> 
>>>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>>> 
>>>> FAQ
>>>> 
>>>> =
>>>> How can I help test this release?
>>>> =
>>>> 
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>> 
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>> you can add the staging repository to your projects resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you don't end up building with a out of date RC going forward).
>>>> 
>>>> ===
>>>> What should happen to JIRA tick

[VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-07 Thread DB Tsai

Please vote on releasing the following candidate as Apache Spark version 2.4.1.

The vote is open until March 11 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.1-rc6 (commit 
201ec8c9b46f9d037cc2e3a5d9c896b9840ca1bc):
https://github.com/apache/spark/tree/v2.4.1-rc6

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1308/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/2.4.1

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.1?
===

The current list of open tickets targeted at 2.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread DB Tsai

Saisai,

There is no blocker now. I ran into some difficulties in publishing the
jars into Nexus. The publish task was finished, but Nexus gave me the
following error.


*failureMessage Failed to validate the pgp signature of
'/org/apache/spark/spark-streaming-flume-assembly_2.11/2.4.1/spark-streaming-flume-assembly_2.11-2.4.1-tests.jar',
check the logs.*

I am sure my key is in the key server, and the weird thing is that it fails
on different jars each time I ran the publish script.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Wed, Mar 6, 2019 at 6:04 PM Saisai Shao  wrote:

> Do we have other block/critical issues for Spark 2.4.1 or waiting
> something to be fixed? I roughly searched the JIRA, seems there's no
> block/critical issues marked for 2.4.1.
>
> Thanks
> Saisai
>
> shane knapp  于2019年3月7日周四 上午4:57写道：
>
>> i'll be popping in to the sig-big-data meeting on the 20th to talk about
>> stuff like this.
>>
>> On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> Yes its a touch decision and as we discussed today (
>>> https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
>>> )
>>> "Kubernetes support window is 9 months, Spark is two years". So we may
>>> end up with old client versions on branches still supported like 2.4.x in
>>> the future.
>>> That gives us no choice but to upgrade, if we want to be on the safe
>>> side. We have tested 3.0.0 with 1.11 internally and it works but I dont
>>> know what it means to run with old
>>> clients.
>>>
>>>
>>> On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:
>>>
>>>> If the old client is basically unusable with the versions of K8S
>>>> people mostly use now, and the new client still works with older
>>>> versions, I could see including this in 2.4.1.
>>>>
>>>> Looking at
>>>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix
>>>> it seems like the 4.1.1 client is needed for 1.10 and above. However
>>>> it no longer supports 1.7 and below.
>>>> We have 3.0.x, and versions through 4.0.x of the client support the
>>>> same K8S versions, so no real middle ground here.
>>>>
>>>> 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
>>>> branches are maintained for 9 months per
>>>> https://kubernetes.io/docs/setup/version-skew-policy/
>>>>
>>>> Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
>>>> used the newer client from the start as at that point (?) 1.7 and
>>>> earlier were already at least 7 months past EOL.
>>>> If we update the client in 2.4.1, versions of K8S as recently
>>>> 'supported' as a year ago won't work anymore. I'm guessing there are
>>>> still 1.7 users out there? That wasn't that long ago but if the
>>>> project and users generally move fast, maybe not.
>>>>
>>>> Normally I'd say, that's what the next minor release of Spark is for;
>>>> update if you want later infra. But there is no Spark 2.5.
>>>> I presume downstream distros could modify the dependency easily (?) if
>>>> needed and maybe already do. It wouldn't necessarily help end users.
>>>>
>>>> Does the 3.0.x client not work at all with 1.10+ or just unsupported.
>>>> If it 'basically works but no guarantees' I'd favor not updating. If
>>>> it doesn't work at all, hm. That's tough. I think I'd favor updating
>>>> the client but think it's a tough call both ways.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
>>>>  wrote:
>>>> >
>>>> > Yes Shane Knapp has done the work for that already,  and also tests
>>>> pass, I am working on a PR now, I could submit it for the 2.4 branch .
>>>> > I understand that this is a major dependency update, but the problem
>>>> I see is that the client version is so old that I dont think it makes
>>>> > much sense for current users who are on k8s 1.10, 1.11 etc(
>>>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
>>>> 3.0.0 does not even exist in there).
>>>> > I dont know what it means to use that old version with current k8s
>>>> clusters in terms of bugs etc.
>>>>
>>>
>>>
>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread DB Tsai

I am cutting a new rc4 with fix from Felix. Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0359BC9965359766

On Thu, Feb 21, 2019 at 8:57 AM Felix Cheung  wrote:
>
> I merged the fix to 2.4.
>
>
> 
> From: Felix Cheung 
> Sent: Wednesday, February 20, 2019 9:34 PM
> To: DB Tsai; Spark dev list
> Cc: Cesar Delgado
> Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>
> Could you hold for a bit - I have one more fix to get in
>
>
> ________
> From: d_t...@apple.com on behalf of DB Tsai 
> Sent: Wednesday, February 20, 2019 12:25 PM
> To: Spark dev list
> Cc: Cesar Delgado
> Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>
> Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.
>
> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>
> > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
> > wrote:
> >
> > Just wanted to point out that
> > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> > and is marked as a correctness bug. (The fix is in the 2.4 branch,
> > just not in rc2.)
> >
> > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
> >>
> >> Please vote on releasing the following candidate as Apache Spark version 
> >> 2.4.1.
> >>
> >> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes 
> >> are cast, with
> >> a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.4.1
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.4.1-rc2 (commit 
> >> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
> >> https://github.com/apache/spark/tree/v2.4.1-rc2
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1299/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
> >>
> >> The list of bug fixes going into 2.4.1 can be found at the following URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.4.1?
> >> ===
> >>
> >> The current list of open tickets targeted at 2.4.1 can be found at:
> >> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> >> Version/s" = 2.4.1
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >>
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
> >>
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-20 Thread DB Tsai

Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
> wrote:
> 
> Just wanted to point out that
> https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> and is marked as a correctness bug. (The fix is in the 2.4 branch,
> just not in rc2.)
> 
> On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
>> 
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.4.1.
>> 
>> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
>> cast, with
>> a minimum of 3 +1 votes.
>> 
>> [ ] +1 Release this package as Apache Spark 2.4.1
>> [ ] -1 Do not release this package because ...
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> 
>> The tag to be voted on is v2.4.1-rc2 (commit 
>> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> https://github.com/apache/spark/tree/v2.4.1-rc2
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>> 
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1299/
>> 
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>> 
>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>> 
>> FAQ
>> 
>> =
>> How can I help test this release?
>> =
>> 
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> 
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>> 
>> ===
>> What should happen to JIRA tickets still targeting 2.4.1?
>> ===
>> 
>> The current list of open tickets targeted at 2.4.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 2.4.1
>> 
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>> 
>> ==
>> But my bug isn't fixed?
>> ==
>> 
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>> 
>> 
>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
>> Inc
>> 
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-20 Thread DB Tsai

Please vote on releasing the following candidate as Apache Spark version 2.4.1.

The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.1-rc2 (commit 
229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
https://github.com/apache/spark/tree/v2.4.1-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1299/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/2.4.1

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.1?
===

The current list of open tickets targeted at 2.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Time to cut an Apache 2.4.1 release?

2019-02-12 Thread DB Tsai

Great. I'll prepare the release for voting. Thanks!

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Feb 12, 2019, at 4:11 AM, Wenchen Fan  wrote:
> 
> +1 for 2.4.1
> 
> On Tue, Feb 12, 2019 at 7:55 PM Hyukjin Kwon  wrote:
> +1 for 2.4.1
> 
> 2019년 2월 12일 (화) 오후 4:56, Dongjin Lee 님이 작성:
> > SPARK-23539 is a non-trivial improvement, so probably would not be 
> > back-ported to 2.4.x.
> 
> Got it. It seems reasonable.
> 
> Committers:
> 
> Please don't omit SPARK-23539 from 2.5.0. Kafka community needs this feature.
> 
> Thanks,
> Dongjin
> 
> On Tue, Feb 12, 2019 at 1:50 PM Takeshi Yamamuro  
> wrote:
> +1, too.
> branch-2.4 accumulates too many commits..:
> https://github.com/apache/spark/compare/0a4c03f7d084f1d2aa48673b99f3b9496893ce8d...af3c7111efd22907976fc8bbd7810fe3cfd92092
> 
> On Tue, Feb 12, 2019 at 12:36 PM Dongjoon Hyun  wrote:
> Thank you, DB.
> 
> +1, Yes. It's time for preparing 2.4.1 release.
> 
> Bests,
> Dongjoon.
> 
> On 2019/02/12 03:16:05, Sean Owen  wrote: 
> > I support a 2.4.1 release now, yes.
> > 
> > SPARK-23539 is a non-trivial improvement, so probably would not be
> > back-ported to 2.4.x.SPARK-26154 does look like a bug whose fix could
> > be back-ported, but that's a big change. I wouldn't hold up 2.4.1 for
> > it, but it could go in if otherwise ready.
> > 
> > 
> > On Mon, Feb 11, 2019 at 5:20 PM Dongjin Lee  wrote:
> > >
> > > Hi DB,
> > >
> > > Could you add SPARK-23539[^1] into 2.4.1? I opened the PR[^2] a little 
> > > bit ago, but it has not included in 2.3.0 nor get enough review.
> > >
> > > Thanks,
> > > Dongjin
> > >
> > > [^1]: https://issues.apache.org/jira/browse/SPARK-23539
> > > [^2]: https://github.com/apache/spark/pull/22282
> > >
> > > On Tue, Feb 12, 2019 at 6:28 AM Jungtaek Lim  wrote:
> > >>
> > >> Given SPARK-26154 [1] is a correctness issue and PR [2] is submitted, I 
> > >> hope it can be reviewed and included within Spark 2.4.1 - otherwise it 
> > >> will be a long-live correctness issue.
> > >>
> > >> Thanks,
> > >> Jungtaek Lim (HeartSaVioR)
> > >>
> > >> 1. https://issues.apache.org/jira/browse/SPARK-26154
> > >> 2. https://github.com/apache/spark/pull/23634
> > >>
> > >>
> > >> 2019년 2월 12일 (화) 오전 6:17, DB Tsai 님이 작성:
> > >>>
> > >>> Hello all,
> > >>>
> > >>> I am preparing to cut a new Apache 2.4.1 release as there are many bugs 
> > >>> and correctness issues fixed in branch-2.4.
> > >>>
> > >>> The list of addressed issues are 
> > >>> https://issues.apache.org/jira/browse/SPARK-26583?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.4.1%20order%20by%20updated%20DESC
> > >>>
> > >>> Let me know if you have any concern or any PR you would like to get in.
> > >>>
> > >>> Thanks!
> > >>>
> > >>> -
> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>>
> > >
> > >
> > > --
> > > Dongjin Lee
> > >
> > > A hitchhiker in the mathematical world.
> > >
> > > github: github.com/dongjinleekr
> > > linkedin: kr.linkedin.com/in/dongjinleekr
> > > speakerdeck: speakerdeck.com/dongjin
> > 
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > 
> > 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> -- 
> Dongjin Lee
> 
> A hitchhiker in the mathematical world.
> 
> github: github.com/dongjinleekr
> linkedin: kr.linkedin.com/in/dongjinleekr
> speakerdeck: speakerdeck.com/dongjin


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Time to cut an Apache 2.4.1 release?

2019-02-11 Thread DB Tsai

Hello all,

I am preparing to cut a new Apache 2.4.1 release as there are many bugs and 
correctness issues fixed in branch-2.4.

The list of addressed issues are 
https://issues.apache.org/jira/browse/SPARK-26583?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.4.1%20order%20by%20updated%20DESC

Let me know if you have any concern or any PR you would like to get in.

Thanks!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

2019-01-23 Thread DB Tsai

-1

Agreed with Anton that this bug will potentially corrupt the data
silently. As he is ready to submit a PR, I'll suggest to wait to
include the fix. Thanks!

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0

On Wed, Jan 23, 2019 at 7:10 AM Anton Okolnychyi
 wrote:
>
> It is a correctness bug. I have updated the description with an example. It 
> has been there for a while, so I am not sure about the priority.
>
> ср, 23 янв. 2019 г. в 14:48, Sean Owen :
>>
>> I'm not clear if it's a correctness bug from that description, and if
>> it's not a regression, no it does not need to go into 2.3.3. If it's a
>> real bug, sure it can be merged to 2.3.x.
>>
>> On Wed, Jan 23, 2019 at 7:54 AM Anton Okolnychyi
>>  wrote:
>> >
>> > Recently, I came across this bug: 
>> > https://issues.apache.org/jira/browse/SPARK-26706.
>> >
>> > It seems appropriate to include it in 2.3.3, doesn't it?
>> >
>> > Thanks,
>> > Anton
>> >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-08 Thread DB Tsai

+1

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0

On Tue, Jan 8, 2019 at 11:14 AM Dongjoon Hyun  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.2.3.
>
> The vote is open until January 11 11:30AM (PST) and passes if a majority +1 
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.2.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.3-rc1 (commit 
> 4acb6ba37b94b90aac445e6546426145a5f9eba2):
> https://github.com/apache/spark/tree/v2.2.3-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.2.3-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1295
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.2.3-rc1-docs/
>
> The list of bug fixes going into 2.2.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343560
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.2.3?
> ===
>
> The current list of open tickets targeted at 2.2.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.2.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Automated formatting

2018-11-21 Thread DB Tsai

I like the idea of checking only the diff. Even I am sometimes confused
about the right style in Spark since I am working on multiple projects with
slightly different coding styles.

On Wed, Nov 21, 2018 at 1:36 PM Sean Owen  wrote:

> I know the PR builder runs SBT, but I presume this would just be a
> separate mvn job that runs. If it doesn't take long and only checks
> the right diff, seems worth a shot. What's the invocation that Shane
> could add (after this change goes in)
> On Wed, Nov 21, 2018 at 3:27 PM Cody Koeninger  wrote:
> >
> > There's a mvn plugin (sbt as well, but it requires sbt 1.0+) so it
> > should be runnable from the PR builder
> >
> > Super basic example with a minimal config that's close to current
> > style guide here:
> >
> > https://github.com/apache/spark/compare/master...koeninger:scalafmt
> >
> > I imagine tracking down the corner cases in the config, especially
> > around interactions with scalastyle, may take a bit of work.  Happy to
> > do it, but not if there's significant concern about style related
> > changes in PRs.
> > On Wed, Nov 21, 2018 at 2:42 PM Sean Owen  wrote:
> > >
> > > Yeah fair, maybe mostly consistent in broad strokes but not in the
> details.
> > > Is this something that can be just run in the PR builder? if the rules
> > > are simple and not too hard to maintain, seems like a win.
> > > On Wed, Nov 21, 2018 at 2:26 PM Cody Koeninger 
> wrote:
> > > >
> > > > Definitely not suggesting a mass reformat, just on a per-PR basis.
> > > >
> > > > scalafmt --diff  will reformat only the files that differ from git
> head
> > > > scalafmt --test --diff won't modify files, just throw an exception if
> > > > they don't match format
> > > >
> > > > I don't think code is consistently formatted now.
> > > > I tried scalafmt on the most recent PR I looked at, and it caught
> > > > stuff as basic as newlines before curly brace in existing code.
> > > > I've had different reviewers for PRs that were literal backports or
> > > > cut & paste of each other come up with different formatting nits.
> > > >
> > > >
> > > > On Wed, Nov 21, 2018 at 12:03 PM Sean Owen  wrote:
> > > > >
> > > > > I think reformatting the whole code base might be too much. If
> there
> > > > > are some more targeted cleanups, sure. We do have some links to
> style
> > > > > guides buried somewhere in the docs, although the conventions are
> > > > > pretty industry standard.
> > > > >
> > > > > I *think* the code is pretty consistently formatted now, and would
> > > > > expect contributors to follow formatting they see, so ideally the
> > > > > surrounding code alone is enough to give people guidance. In
> practice,
> > > > > we're always going to have people format differently no matter
> what I
> > > > > think so it's inevitable.
> > > > >
> > > > > Is there a way to just check style on PR changes? that's fine.
> > > > > On Wed, Nov 21, 2018 at 11:40 AM Cody Koeninger <
> c...@koeninger.org> wrote:
> > > > > >
> > > > > > Is there any appetite for revisiting automating formatting?
> > > > > >
> > > > > > I know over the years various people have expressed opposition
> to it
> > > > > > as unnecessary churn in diffs, but having every new contributor
> > > > > > greeted with "nit: 4 space indentation for argument lists" isn't
> very
> > > > > > welcoming.
> > > > > >
> > > > > >
> -
> > > > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > > > >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
- DB Sent from my iPhone

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-21 Thread DB Tsai

+1 on removing Scala 2.11 support for 3.0 given  Scala 2.11 is already EOL.

On Tue, Nov 20, 2018 at 2:53 PM Sean Owen  wrote:

> PS: pull request at https://github.com/apache/spark/pull/23098
> Not going to merge it until there's clear agreement.
>
> On Tue, Nov 20, 2018 at 10:16 AM Ryan Blue  wrote:
> >
> > +1 to removing 2.11 support for 3.0 and a PR.
> >
> > It sounds like having multiple Scala builds is just not feasible and I
> don't think this will be too disruptive for users since it is already a
> breaking change.
> >
> > On Tue, Nov 20, 2018 at 7:05 AM Sean Owen  wrote:
> >>
> >> One more data point -- from looking at the SBT build yesterday, it
> >> seems like most plugin updates require SBT 1.x. And both they and SBT
> >> 1.x seem to need Scala 2.12. And the new zinc also does.
> >> Now, the current SBT and zinc and plugins all appear to work OK with
> >> 2.12 now, but updating will pretty much have to wait until 2.11
> >> support goes. (I don't think it's feasible to have two SBT builds.)
> >>
> >> I actually haven't heard an argument for keeping 2.11, compared to the
> >> overhead of maintaining it. Any substantive objections? Would it be
> >> too forward to put out a WIP PR that removes it?
> >>
> >> On Sat, Nov 17, 2018 at 7:28 PM Sean Owen  wrote:
> >> >
> >> > I support dropping 2.11 support. My general logic is:
> >> >
> >> > - 2.11 is EOL, and is all the more EOL in the middle of next year when
> >> > Spark 3 arrives
> >> > - I haven't heard of a critical dependency that has no 2.12
> counterpart
> >> > - 2.11 users can stay on 2.4.x, which will be notionally supported
> >> > through, say, end of 2019
> >> > - Maintaining 2.11 vs 2.12 support is modestly difficult, in my
> >> > experience resolving these differences across these two versions; it's
> >> > a hassle as you need two git clones with different scala versions in
> >> > the project tags
> >> > - The project is already short on resources to support things as it is
> >> > - Dropping things is generally necessary to add new things, to keep
> >> > complexity reasonable -- like Scala 2.13 support
> >> >
> >> > Maintaining a separate PR builder for 2.11 isn't so bad
> >> >
> >> > On Fri, Nov 16, 2018 at 4:09 PM Marcelo Vanzin
> >> >  wrote:
> >> > >
> >> > > Now that the switch to 2.12 by default has been made, it might be
> good
> >> > > to have a serious discussion about dropping 2.11 altogether. Many of
> >> > > the main arguments have already been talked about. But I don't
> >> > > remember anyone mentioning how easy it would be to break the 2.11
> >> > > build now.
> >> > >
> >> > > For example, the following works fine in 2.12 but breaks in 2.11:
> >> > >
> >> > > java.util.Arrays.asList("hi").stream().forEach(println)
> >> > >
> >> > > We had a similar issue when we supported java 1.6 but the builds
> were
> >> > > all on 1.7 by default. Every once in a while something would
> silently
> >> > > break, because PR builds only check the default. And the jenkins
> >> > > builds, which are less monitored, would stay broken for a while.
> >> > >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
- DB Sent from my iPhone

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-16 Thread DB Tsai

Most of the time in the PR build is on running tests. How about we
also add Scala 2.11 compilation for both main and test without running
the tests in the PR build?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0

On Fri, Nov 16, 2018 at 10:09 PM Marcelo Vanzin
 wrote:
>
> Now that the switch to 2.12 by default has been made, it might be good
> to have a serious discussion about dropping 2.11 altogether. Many of
> the main arguments have already been talked about. But I don't
> remember anyone mentioning how easy it would be to break the 2.11
> build now.
>
> For example, the following works fine in 2.12 but breaks in 2.11:
>
> java.util.Arrays.asList("hi").stream().forEach(println)
>
> We had a similar issue when we supported java 1.6 but the builds were
> all on 1.7 by default. Every once in a while something would silently
> break, because PR builds only check the default. And the jenkins
> builds, which are less monitored, would stay broken for a while.
>
> On Tue, Nov 6, 2018 at 11:13 AM DB Tsai  wrote:
> >
> > We made Scala 2.11 as default Scala version in Spark 2.0. Now, the next 
> > Spark version will be 3.0, so it's a great time to discuss should we make 
> > Scala 2.12 as default Scala version in Spark 3.0.
> >
> > Scala 2.11 is EOL, and it came out 4.5 ago; as a result, it's unlikely to 
> > support JDK 11 in Scala 2.11 unless we're willing to sponsor the needed 
> > work per discussion in Scala community, 
> > https://github.com/scala/scala-dev/issues/559#issuecomment-436160166
> >
> > We have initial support of Scala 2.12 in Spark 2.4. If we decide to make 
> > Scala 2.12 as default for Spark 3.0 now, we will have ample time to work on 
> > bugs and issues that we may run into.
> >
> > What do you think?
> >
> > Thanks,
> >
> > DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
> > Inc
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-08 Thread DB Tsai

Based on the discussions, I created a PR that makes Spark's default
Scala version as 2.12, and then Scala 2.11 will be the alternative
version. This implies that Scala 2.12 will be used by our CI builds
including pull request builds.

https://github.com/apache/spark/pull/22967

We can decide later if we want to change the alternative Scala version
to 2.13 and drop 2.11 if we just want to support two Scala versions at
one time.

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0
On Wed, Nov 7, 2018 at 11:18 AM Sean Owen  wrote:
>
> It's not making 2.12 the default, but not dropping 2.11. Supporting
> 2.13 could mean supporting 3 Scala versions at once, which I claim is
> just too much. I think the options are likely:
>
> - Support 2.11, 2.12 in Spark 3.0. Deprecate 2.11 and make 2.12 the
> default. Add 2.13 support in 3.x and drop 2.11 in the same release
> - Deprecate 2.11 right now via announcement and/or Spark 2.4.1 soon.
> Drop 2.11 support in Spark 3.0, and support only 2.12.
> - (same as above, but add Spark 2.13 support if possible for Spark 3.0)
>
>
> On Wed, Nov 7, 2018 at 12:32 PM Mark Hamstra  wrote:
> >
> > I'm not following "exclude Scala 2.13". Is there something inherent in 
> > making 2.12 the default Scala version in Spark 3.0 that would prevent us 
> > from supporting the option of building with 2.13?
> >
> > On Tue, Nov 6, 2018 at 5:48 PM Sean Owen  wrote:
> >>
> >> That's possible here, sure. The issue is: would you exclude Scala 2.13
> >> support in 3.0 for this, if it were otherwise ready to go?
> >> I think it's not a hard rule that something has to be deprecated
> >> previously to be removed in a major release. The notice is helpful,
> >> sure, but there are lots of ways to provide that notice to end users.
> >> Lots of things are breaking changes in a major release. Or: deprecate
> >> in Spark 2.4.1, if desired?
> >>
> >> On Tue, Nov 6, 2018 at 7:36 PM Wenchen Fan  wrote:
> >> >
> >> > We make Scala 2.11 the default one in Spark 2.0, then drop Scala 2.10 in 
> >> > Spark 2.3. Shall we follow it and drop Scala 2.11 at some point of Spark 
> >> > 3.x?
> >> >
> >> > On Wed, Nov 7, 2018 at 8:55 AM Reynold Xin  wrote:
> >> >>
> >> >> Have we deprecated Scala 2.11 already in an existing release?
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread DB Tsai

Ideally, supporting only Scala 2.12 in Spark 3 will be ideal.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Nov 6, 2018, at 2:55 PM, Felix Cheung  wrote:
> 
> So to clarify, only scala 2.12 is supported in Spark 3?
> 
>  
> From: Ryan Blue 
> Sent: Tuesday, November 6, 2018 1:24 PM
> To: d_t...@apple.com
> Cc: Sean Owen; Spark Dev List; cdelg...@apple.com
> Subject: Re: Make Scala 2.12 as default Scala version in Spark 3.0
>  
> +1 to Scala 2.12 as the default in Spark 3.0.
> 
> On Tue, Nov 6, 2018 at 11:50 AM DB Tsai  wrote:
> +1 on dropping Scala 2.11 in Spark 3.0 to simplify the build. 
> 
> As Scala 2.11 will not support Java 11 unless we make a significant 
> investment, if we decide not to drop Scala 2.11 in Spark 3.0, what we can do 
> is have only Scala 2.12 build support Java 11 while Scala 2.11 support Java 
> 8. But I agree with Sean that this can make the decencies really complicated; 
> hence I support to drop Scala 2.11 in Spark 3.0 directly.
> 
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
> Inc
> 
>> On Nov 6, 2018, at 11:38 AM, Sean Owen  wrote:
>> 
>> I think we should make Scala 2.12 the default in Spark 3.0. I would
>> also prefer to drop Scala 2.11 support in 3.0. In theory, not dropping
>> 2.11 support it means we'd support Scala 2.11 for years, the lifetime
>> of Spark 3.x. In practice, we could drop 2.11 support in a 3.1.0 or
>> 3.2.0 release, kind of like what happened with 2.10 in 2.x.
>> 
>> Java (9-)11 support also complicates this. I think getting it to work
>> will need some significant dependency updates, and I worry not all
>> will be available for 2.11 or will present some knotty problems. We'll
>> find out soon if that forces the issue.
>> 
>> Also note that Scala 2.13 is pretty close to release, and we'll want
>> to support it soon after release, perhaps sooner than the long delay
>> before 2.12 was supported (because it was hard!). It will probably be
>> out well before Spark 3.0. Cross-compiling for 3 Scala versions sounds
>> like too much. 3.0 could support 2.11 and 2.12, and 3.1 support 2.12
>> and 2.13, or something. But if 2.13 support is otherwise attainable at
>> the release of Spark 3.0, I wonder if that too argues for dropping
>> 2.11 support.
>> 
>> Finally I'll say that Spark itself isn't dropping 2.11 support for a
>> while, no matter what; it still exists in the 2.4.x branch of course.
>> People who can't update off Scala 2.11 can stay on Spark 2.x, note.
>> 
>> Sean
>> 
>> 
>> On Tue, Nov 6, 2018 at 1:13 PM DB Tsai  wrote:
>>> 
>>> We made Scala 2.11 as default Scala version in Spark 2.0. Now, the next 
>>> Spark version will be 3.0, so it's a great time to discuss should we make 
>>> Scala 2.12 as default Scala version in Spark 3.0.
>>> 
>>> Scala 2.11 is EOL, and it came out 4.5 ago; as a result, it's unlikely to 
>>> support JDK 11 in Scala 2.11 unless we're willing to sponsor the needed 
>>> work per discussion in Scala community, 
>>> https://github.com/scala/scala-dev/issues/559#issuecomment-436160166
>>> 
>>> We have initial support of Scala 2.12 in Spark 2.4. If we decide to make 
>>> Scala 2.12 as default for Spark 3.0 now, we will have ample time to work on 
>>> bugs and issues that we may run into.
>>> 
>>> What do you think?
>>> 
>>> Thanks,
>>> 
>>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
>>> Inc
>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread DB Tsai

+1 on dropping Scala 2.11 in Spark 3.0 to simplify the build. 

As Scala 2.11 will not support Java 11 unless we make a significant investment, 
if we decide not to drop Scala 2.11 in Spark 3.0, what we can do is have only 
Scala 2.12 build support Java 11 while Scala 2.11 support Java 8. But I agree 
with Sean that this can make the decencies really complicated; hence I support 
to drop Scala 2.11 in Spark 3.0 directly.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Nov 6, 2018, at 11:38 AM, Sean Owen  wrote:
> 
> I think we should make Scala 2.12 the default in Spark 3.0. I would
> also prefer to drop Scala 2.11 support in 3.0. In theory, not dropping
> 2.11 support it means we'd support Scala 2.11 for years, the lifetime
> of Spark 3.x. In practice, we could drop 2.11 support in a 3.1.0 or
> 3.2.0 release, kind of like what happened with 2.10 in 2.x.
> 
> Java (9-)11 support also complicates this. I think getting it to work
> will need some significant dependency updates, and I worry not all
> will be available for 2.11 or will present some knotty problems. We'll
> find out soon if that forces the issue.
> 
> Also note that Scala 2.13 is pretty close to release, and we'll want
> to support it soon after release, perhaps sooner than the long delay
> before 2.12 was supported (because it was hard!). It will probably be
> out well before Spark 3.0. Cross-compiling for 3 Scala versions sounds
> like too much. 3.0 could support 2.11 and 2.12, and 3.1 support 2.12
> and 2.13, or something. But if 2.13 support is otherwise attainable at
> the release of Spark 3.0, I wonder if that too argues for dropping
> 2.11 support.
> 
> Finally I'll say that Spark itself isn't dropping 2.11 support for a
> while, no matter what; it still exists in the 2.4.x branch of course.
> People who can't update off Scala 2.11 can stay on Spark 2.x, note.
> 
> Sean
> 
> 
> On Tue, Nov 6, 2018 at 1:13 PM DB Tsai  wrote:
>> 
>> We made Scala 2.11 as default Scala version in Spark 2.0. Now, the next 
>> Spark version will be 3.0, so it's a great time to discuss should we make 
>> Scala 2.12 as default Scala version in Spark 3.0.
>> 
>> Scala 2.11 is EOL, and it came out 4.5 ago; as a result, it's unlikely to 
>> support JDK 11 in Scala 2.11 unless we're willing to sponsor the needed work 
>> per discussion in Scala community, 
>> https://github.com/scala/scala-dev/issues/559#issuecomment-436160166
>> 
>> We have initial support of Scala 2.12 in Spark 2.4. If we decide to make 
>> Scala 2.12 as default for Spark 3.0 now, we will have ample time to work on 
>> bugs and issues that we may run into.
>> 
>> What do you think?
>> 
>> Thanks,
>> 
>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
>> Inc
>> 
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

Re: Test and support only LTS JDK release?

2018-11-06 Thread DB Tsai

OpenJDK will follow Oracle's release cycle, 
https://openjdk.java.net/projects/jdk/ 
<https://openjdk.java.net/projects/jdk/>, a strict six months model. I'm not 
familiar with other non-Oracle VMs and Redhat support.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Nov 6, 2018, at 11:26 AM, Reynold Xin  wrote:
> 
> What does OpenJDK do and other non-Oracle VMs? I know there was a lot of 
> discussions from Redhat etc to support.
> 
> 
> On Tue, Nov 6, 2018 at 11:24 AM DB Tsai  <mailto:d_t...@apple.com>> wrote:
> Given Oracle's new 6-month release model, I feel the only realistic option is 
> to only test and support JDK such as JDK 11 LTS and future LTS release. I 
> would like to have a discussion on this in Spark community.  
> 
> Thanks,
> 
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
> Inc
>

Test and support only LTS JDK release?

2018-11-06 Thread DB Tsai

Given Oracle's new 6-month release model, I feel the only realistic option is 
to only test and support JDK such as JDK 11 LTS and future LTS release. I would 
like to have a discussion on this in Spark community.  

Thanks,

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread DB Tsai

We made Scala 2.11 as default Scala version in Spark 2.0. Now, the next Spark 
version will be 3.0, so it's a great time to discuss should we make Scala 2.12 
as default Scala version in Spark 3.0.

Scala 2.11 is EOL, and it came out 4.5 ago; as a result, it's unlikely to 
support JDK 11 in Scala 2.11 unless we're willing to sponsor the needed work 
per discussion in Scala community, 
https://github.com/scala/scala-dev/issues/559#issuecomment-436160166

We have initial support of Scala 2.12 in Spark 2.4. If we decide to make Scala 
2.12 as default for Spark 3.0 now, we will have ample time to work on bugs and 
issues that we may run into.

What do you think?

Thanks, 

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Java 11 support

2018-11-06 Thread DB Tsai

Scala 2.11 is EOL, and only Scala 2.12 will support JDK 11 
https://github.com/scala/scala-dev/issues/559#issuecomment-436160166 
<https://github.com/scala/scala-dev/issues/559#issuecomment-436160166> , we 
might need to make Scala 2.12 as default version in Spark 3.0 to move forward. 

Given Oracle's new 6-month release model, I think the only realistic option is 
to only support and test LTS JDK. I'll send out two separate emails to dev to 
facilitate the discussion. 

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Nov 6, 2018, at 9:47 AM, shane knapp  wrote:
> 
> cool, i was wondering when we were going to forge ahead in to the great 
> future of jdk8++...  i went ahead and created a sub-task of installing a 
> newer version of java on the build nodes 
> (https://issues.apache.org/jira/browse/SPARK-25953 
> <https://issues.apache.org/jira/browse/SPARK-25953>), and once we figure out 
> exact what version we want i'll go ahead and get that done.
> 
> On Tue, Nov 6, 2018 at 9:11 AM Sean Owen  <mailto:sro...@gmail.com>> wrote:
> I think that Java 9 support basically gets Java 10, 11 support. But
> the jump from 8 to 9 is unfortunately more breaking than usual because
> of the total revamping of the internal JDK classes. I think it will be
> mostly a matter of dependencies needing updates to work. I agree this
> is probably pretty important for Spark 3. Here's the ticket I know of:
> https://issues.apache.org/jira/browse/SPARK-24417 
> <https://issues.apache.org/jira/browse/SPARK-24417> . DB is already
> working on some of it, I see.
> On Tue, Nov 6, 2018 at 10:59 AM Felix Cheung  <mailto:felixcheun...@hotmail.com>> wrote:
> >
> > Speaking of, get we work to support Java 11?
> > That will fix all the problems below.
> >
> >
> >
> > 
> > From: Felix Cheung  > <mailto:felixcheun...@hotmail.com>>
> > Sent: Tuesday, November 6, 2018 8:57 AM
> > To: Wenchen Fan
> > Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > We have not been able to publish to CRAN for quite some time (since 2.3.0 
> > was archived - the cause is Java 11)
> >
> > I think it’s ok to announce the release of 2.4.0
> >
> >
> > 
> > From: Wenchen Fan mailto:cloud0...@gmail.com>>
> > Sent: Tuesday, November 6, 2018 8:51 AM
> > To: Felix Cheung
> > Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > Do you mean we should have a 2.4.0 release without CRAN and then do a 2.4.1 
> > immediately?
> >
> > On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung  > <mailto:felixcheun...@hotmail.com>> wrote:
> >>
> >> Shivaram and I were discussing.
> >> Actually we worked with them before. Another possible approach is to 
> >> remove the vignettes eval and all test from the source package... in the 
> >> next release.
> >>
> >>
> >> 
> >> From: Matei Zaharia  >> <mailto:matei.zaha...@gmail.com>>
> >> Sent: Tuesday, November 6, 2018 12:07 AM
> >> To: Felix Cheung
> >> Cc: Sean Owen; dev; Shivaram Venkataraman
> >> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >>
> >> Maybe it’s wroth contacting the CRAN maintainers to ask for help? Perhaps 
> >> we aren’t disabling it correctly, or perhaps they can ignore this specific 
> >> failure. +Shivaram who might have some ideas.
> >>
> >> Matei
> >>
> >> > On Nov 5, 2018, at 9:09 PM, Felix Cheung  >> > <mailto:felixcheun...@hotmail.com>> wrote:
> >> >
> >> > I don¡Št know what the cause is yet.
> >> >
> >> > The test should be skipped because of this check
> >> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
> >> >  
> >> > <https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21>
> >> >
> >> > And this
> >> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
> >> >  
> >> > <https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57>
> >> >
> >> > But it ran:
> >> > callJStatic("org.apache.spark.ml.r

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-29 Thread DB Tsai

+0

I understand that schema pruning is an experimental feature in Spark
2.4, and this can help a lot in read performance as people are trying
to keep the hierarchical data in nested format.

We just found a serious bug---it could fail parquet reader if a nested
field and top level field are selected simultaneously.
https://issues.apache.org/jira/browse/SPARK-25879

If we decide to not fix it in 2.4, we should at least document it in
the release note to let users know.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0
On Mon, Oct 29, 2018 at 8:42 PM Hyukjin Kwon  wrote:
>
> +1
>
> 2018년 10월 30일 (화) 오전 11:03, Gengliang Wang 님이 작성:
>>
>> +1
>>
>> > 在 2018年10月30日，上午10:41，Sean Owen  写道：
>> >
>> > +1
>> >
>> > Same result as in RC4 from me, and the issues I know of that were
>> > raised with RC4 are resolved. I tested vs Scala 2.12 and 2.11.
>> >
>> > These items are still targeted to 2.4.0; Xiangrui I assume these
>> > should just be untargeted now, or resolved?
>> > SPARK-25584 Document libsvm data source in doc site
>> > SPARK-25346 Document Spark builtin data sources
>> > SPARK-24464 Unit tests for MLlib's Instrumentation
>> > On Mon, Oct 29, 2018 at 5:22 AM Wenchen Fan  wrote:
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark version 
>> >> 2.4.0.
>> >>
>> >> The vote is open until November 1 PST and passes if a majority +1 PMC 
>> >> votes are cast, with
>> >> a minimum of 3 +1 votes.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 2.4.0
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>> >>
>> >> The tag to be voted on is v2.4.0-rc5 (commit 
>> >> 0a4c03f7d084f1d2aa48673b99f3b9496893ce8d):
>> >> https://github.com/apache/spark/tree/v2.4.0-rc5
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-bin/
>> >>
>> >> Signatures used for Spark RCs can be found in this file:
>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>
>> >> The staging repository for this release can be found at:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1291
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-docs/
>> >>
>> >> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> >> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>> >>
>> >> FAQ
>> >>
>> >> =
>> >> How can I help test this release?
>> >> =
>> >>
>> >> If you are a Spark user, you can help us test this release by taking
>> >> an existing Spark workload and running on this release candidate, then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and install
>> >> the current RC and see if anything important breaks, in the Java/Scala
>> >> you can add the staging repository to your projects resolvers and test
>> >> with the RC (make sure to clean up the artifact cache before/after so
>> >> you don't end up building with a out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 2.4.0?
>> >> ===
>> >>
>> >> The current list of open tickets targeted at 2.4.0 can be found at:
>> >> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> >> Version/s" = 2.4.0
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility should
>> >> be worked on immediately. Everything else please retarget to an
>> >> appropriate release.
>> >>
>> >> ==
>> >> But my bug isn't fixed?
>> >> ==
>> >>
>> >> In order to make timely releases, we will typically not hold the
>> >> release unless the bug in question is a regression from the previous
>> >> release. That being said, if there is something which is a regression
>> >> that has not been correctly targeted please ping me or a committer to
>> >> help target the issue.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Starting to make changes for Spark 3 -- what can we delete?

2018-10-17 Thread DB Tsai

I'll +1 on removing those legacy mllib code. Many users are confused about the 
APIs, and some of them have weird behaviors (for example, in gradient descent, 
the intercept is regularized which supports not to). 

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Oct 17, 2018, at 7:42 AM, Erik Erlandson  wrote:
> 
> My understanding was that the legacy mllib api was frozen, with all new dev 
> going to ML, but it was not going to be removed. Although removing it would 
> get rid of a lot of `OldXxx` shims.
> 
> On Wed, Oct 17, 2018 at 12:55 AM Marco Gaido  wrote:
> Hi all,
> 
> I think a very big topic on this would be: what do we want to do with the old 
> mllib API? For long I have been told that it was going to be removed on 3.0. 
> Is this still the plan?
> 
> Thanks,
> Marco
> 
> Il giorno mer 17 ott 2018 alle ore 03:11 Marcelo Vanzin 
>  ha scritto:
> Might be good to take a look at things marked "@DeveloperApi" and
> whether they should stay that way.
> 
> e.g. I was looking at SparkHadoopUtil and I've always wanted to just
> make it private to Spark. I don't see why apps would need any of those
> methods.
> On Tue, Oct 16, 2018 at 10:18 AM Sean Owen  wrote:
>> 
>> There was already agreement to delete deprecated things like Flume and
>> Kafka 0.8 support in master. I've got several more on my radar, and
>> wanted to highlight them and solicit general opinions on where we
>> should accept breaking changes.
>> 
>> For example how about removing accumulator v1?
>> https://github.com/apache/spark/pull/22730
>> 
>> Or using the standard Java Optional?
>> https://github.com/apache/spark/pull/22383
>> 
>> Or cleaning up some old workarounds and APIs while at it?
>> https://github.com/apache/spark/pull/22729 (still in progress)
>> 
>> I think I talked myself out of replacing Java function interfaces with
>> java.util.function because...
>> https://issues.apache.org/jira/browse/SPARK-25369
>> 
>> There are also, say, old json and csv and avro reading method
>> deprecated since 1.4. Remove?
>> Anything deprecated since 2.0.0?
>> 
>> Interested in general thoughts on these.
>> 
>> Here are some more items targeted to 3.0:
>> https://issues.apache.org/jira/browse/SPARK-17875?jql=project%3D%22SPARK%22%20AND%20%22Target%20Version%2Fs%22%3D%223.0.0%22%20ORDER%20BY%20priority%20ASC
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Scala 2.12 support

2018-06-07 Thread DB Tsai

It is from the most recent 2.11

I don’t try it yet on 2.12, but I expect to get the same result.

On Thu, Jun 7, 2018 at 6:28 PM Wenchen Fan  wrote:

> One more point: There was a time that we maintain 2 Spark REPL codebase
> for Scala 2.10 and 2.11, maybe we can do the same for Scala 2.11 and 2.12?
> if it's too hard to find a common way to do that between different Scala
> versions.
>
> On Thu, Jun 7, 2018 at 6:20 PM, Marcelo Vanzin 
> wrote:
>
>> But DB's shell output is on the most recent 2.11, not 2.12, right?
>>
>> On Thu, Jun 7, 2018 at 5:54 PM, Holden Karau 
>> wrote:
>> > I agree that's a little odd, could we not add the bacspace terminal
>> > character? Regardless even if not, I don't think that should be a
>> blocker
>> > for 2.12 support especially since it doesn't degrade the 2.11
>> experience.
>> >
>> > On Thu, Jun 7, 2018, 5:53 PM DB Tsai  wrote:
>> >>
>> >> If we decide to initialize Spark in `initializeSynchronous()` in Scala
>> >> 2.11.12, it will look like the following which is odd.
>> >>
>> >> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
>> >> 1.8.0_161)
>> >> Type in expressions to have them evaluated.
>> >> Type :help for more information.
>> >>
>> >> scala> Spark context Web UI available at http://192.168.1.169:4040
>> >> Spark context available as 'sc' (master = local[*], app id =
>> >> local-1528180279528).
>> >> Spark session available as 'spark’.
>> >> scala>
>> >>
>> >> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
>> >> Apple, Inc
>> >>
>> >> On Jun 7, 2018, at 5:49 PM, Holden Karau  wrote:
>> >>
>> >> Tests can just be changed to accept either output too :p
>> >>
>> >> On Thu, Jun 7, 2018, 5:19 PM Dean Wampler 
>> wrote:
>> >>>
>> >>> Do the tests expect a particular console output order? That would
>> annoy
>> >>> them. ;) You could sort the expected and output lines, then diff...
>> >>>
>> >>> Dean Wampler, Ph.D.
>> >>> VP, Fast Data Engineering at Lightbend
>> >>> Author: Programming Scala, 2nd Edition, Fast Data Architectures for
>> >>> Streaming Applications, and other content from O'Reilly
>> >>> @deanwampler
>> >>> http://polyglotprogramming.com
>> >>> https://github.com/deanwampler
>> >>>
>> >>> On Thu, Jun 7, 2018 at 5:09 PM, Holden Karau 
>> >>> wrote:
>> >>>>
>> >>>> If the difference is the order of the welcome message I think that
>> >>>> should be fine.
>> >>>>
>> >>>> On Thu, Jun 7, 2018, 4:43 PM Dean Wampler 
>> wrote:
>> >>>>>
>> >>>>> I'll point the Scala team to this issue, but it's unlikely to get
>> fixed
>> >>>>> any time soon.
>> >>>>>
>> >>>>> dean
>> >>>>>
>> >>>>> Dean Wampler, Ph.D.
>> >>>>> VP, Fast Data Engineering at Lightbend
>> >>>>> Author: Programming Scala, 2nd Edition, Fast Data Architectures for
>> >>>>> Streaming Applications, and other content from O'Reilly
>> >>>>> @deanwampler
>> >>>>> http://polyglotprogramming.com
>> >>>>> https://github.com/deanwampler
>> >>>>>
>> >>>>> On Thu, Jun 7, 2018 at 4:27 PM, DB Tsai  wrote:
>> >>>>>>
>> >>>>>> Thanks Felix for bringing this up.
>> >>>>>>
>> >>>>>> Currently, in Scala 2.11.8, we initialize the Spark by overriding
>> >>>>>> loadFIles() before REPL sees any file since there is no good hook
>> in Scala
>> >>>>>> to load our initialization code.
>> >>>>>>
>> >>>>>> In Scala 2.11.12 and newer version of the Scala 2.12.x, loadFIles()
>> >>>>>> method was removed.
>> >>>>>>
>> >>>>>> Alternatively, one way we can do in the newer version of Scala is
>> by
>> >>>>>> overriding initializeSynchronous() suggested by Som Snytt; I have
>> a working
>> >>>>

Re: Scala 2.12 support

2018-06-07 Thread DB Tsai

If we decide to initialize Spark in `initializeSynchronous()` in Scala 2.11.12, 
it will look like the following which is odd.

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.

scala> Spark context Web UI available at http://192.168.1.169:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1528180279528).
Spark session available as 'spark’.
scala>

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Jun 7, 2018, at 5:49 PM, Holden Karau  wrote:
> 
> Tests can just be changed to accept either output too :p
> 
> On Thu, Jun 7, 2018, 5:19 PM Dean Wampler  <mailto:deanwamp...@gmail.com>> wrote:
> Do the tests expect a particular console output order? That would annoy them. 
> ;) You could sort the expected and output lines, then diff...
> 
> Dean Wampler, Ph.D.
> VP, Fast Data Engineering at Lightbend
> Author: Programming Scala, 2nd Edition 
> <http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures 
> for Streaming Applications 
> <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
>  and other content from O'Reilly
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com <http://polyglotprogramming.com/>
> https://github.com/deanwampler <https://github.com/deanwampler>
> 
> On Thu, Jun 7, 2018 at 5:09 PM, Holden Karau  <mailto:hol...@pigscanfly.ca>> wrote:
> If the difference is the order of the welcome message I think that should be 
> fine.
> 
> On Thu, Jun 7, 2018, 4:43 PM Dean Wampler  <mailto:deanwamp...@gmail.com>> wrote:
> I'll point the Scala team to this issue, but it's unlikely to get fixed any 
> time soon.
> 
> dean
> 
> Dean Wampler, Ph.D.
> VP, Fast Data Engineering at Lightbend
> Author: Programming Scala, 2nd Edition 
> <http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures 
> for Streaming Applications 
> <http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
>  and other content from O'Reilly
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com <http://polyglotprogramming.com/>
> https://github.com/deanwampler <https://github.com/deanwampler>
> 
> On Thu, Jun 7, 2018 at 4:27 PM, DB Tsai  <mailto:d_t...@apple.com>> wrote:
> Thanks Felix for bringing this up.
> 
> Currently, in Scala 2.11.8, we initialize the Spark by overriding loadFIles() 
> before REPL sees any file since there is no good hook in Scala to load our 
> initialization code.
> 
> In Scala 2.11.12 and newer version of the Scala 2.12.x, loadFIles() method 
> was removed.
> 
> Alternatively, one way we can do in the newer version of Scala is by 
> overriding initializeSynchronous() suggested by Som Snytt; I have a working 
> PR with this approach,
> https://github.com/apache/spark/pull/21495 
> <https://github.com/apache/spark/pull/21495> , and this approach should work 
> for older version of Scala too. 
> 
> However, in the newer version of Scala, the first thing that the REPL calls 
> is printWelcome, so in the newer version of Scala, welcome message will be 
> shown and then the URL of the SparkUI in this approach. This will cause UI 
> inconsistencies between different versions of Scala.
> 
> We can also initialize the Spark in the printWelcome which I feel more hacky. 
> It will only work for newer version of Scala since in order version of Scala, 
> printWelcome is called in the end of the initialization process. If we decide 
> to go this route, basically users can not use Scala older than 2.11.9.
> 
> I think this is also a blocker for us to move to newer version of Scala 
> 2.12.x since the newer version of Scala 2.12.x has the same issue.
> 
> In my opinion, Scala should fix the root cause and provide a stable hook for 
> 3rd party developers to initialize their custom code.
> 
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
> Inc
> 
> > On Jun 7, 2018, at 6:43 AM, Felix Cheung  > <mailto:felixcheun...@hotmail.com>> wrote:
> > 
> > +1
> > 
> > Spoke to Dean as well and mentioned the problem with 2.11.12 
> > https://github.com/scala/bug/issues/10913 
> > <https://github.com/scala/bug/issues/10913>
> > 
> > _
> > From: Sean Owen mailto:sro...@gmail.com>>
> > Sent: Wednesday, June 6, 2018 12:23 PM
> > Subject: Re: Scala 2.12 support
> > To: Holden Karau mailto:hol..

Re: Scala 2.12 support

2018-06-07 Thread DB Tsai

Thanks Felix for bringing this up.

Currently, in Scala 2.11.8, we initialize the Spark by overriding loadFIles() 
before REPL sees any file since there is no good hook in Scala to load our 
initialization code.

In Scala 2.11.12 and newer version of the Scala 2.12.x, loadFIles() method was 
removed.

Alternatively, one way we can do in the newer version of Scala is by overriding 
initializeSynchronous() suggested by Som Snytt; I have a working PR with this 
approach,
https://github.com/apache/spark/pull/21495 , and this approach should work for 
older version of Scala too. 

However, in the newer version of Scala, the first thing that the REPL calls is 
printWelcome, so in the newer version of Scala, welcome message will be shown 
and then the URL of the SparkUI in this approach. This will cause UI 
inconsistencies between different versions of Scala.

We can also initialize the Spark in the printWelcome which I feel more hacky. 
It will only work for newer version of Scala since in order version of Scala, 
printWelcome is called in the end of the initialization process. If we decide 
to go this route, basically users can not use Scala older than 2.11.9.

I think this is also a blocker for us to move to newer version of Scala 2.12.x 
since the newer version of Scala 2.12.x has the same issue.

In my opinion, Scala should fix the root cause and provide a stable hook for 
3rd party developers to initialize their custom code.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Jun 7, 2018, at 6:43 AM, Felix Cheung  wrote:
> 
> +1
> 
> Spoke to Dean as well and mentioned the problem with 2.11.12 
> https://github.com/scala/bug/issues/10913
> 
> _
> From: Sean Owen 
> Sent: Wednesday, June 6, 2018 12:23 PM
> Subject: Re: Scala 2.12 support
> To: Holden Karau 
> Cc: Dean Wampler , Reynold Xin , 
> dev 
> 
> 
> If it means no change to 2.11 support, seems OK to me for Spark 2.4.0. The 
> 2.12 support is separate and has never been mutually compatible with 2.11 
> builds anyway. (I also hope, suspect that the changes are minimal; tests are 
> already almost entirely passing with no change to the closure cleaner when 
> built for 2.12)
> 
> On Wed, Jun 6, 2018 at 1:33 PM Holden Karau  wrote:
> Just chatted with Dean @ the summit and it sounds like from Adriaan there is 
> a fix in 2.13 for the API change issue that could be back ported to 2.12 so 
> how about we try and get this ball rolling?
> 
> It sounds like it would also need a closure cleaner change, which could be 
> backwards compatible but since it’s such a core component and we might want 
> to be cautious with it, we could when building for 2.11 use the old cleaner 
> code and for 2.12 use the new code so we don’t break anyone.
> 
> How do folks feel about this?
> 
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [MLLib] Logistic Regression and standadization

2018-04-24 Thread DB Tsai

As I’m one of the original authors, let me chime in for some comments. 

Without the standardization, the LBFGS will be unstable. For example, if a 
feature is being x 10, then the corresponding coefficient should be / 10 to 
make the same prediction. But without standardization, the LBFGS will converge 
to different solution due to numerical stability.

TLDR, this can be implemented in the optimizer or in the trainer. We choose to 
implement in the trainer as LBFGS optimizer in breeze suffers this issue. As an 
user, you don’t need to care much even you have one-hot encoding features, and 
the result should match R. 

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Apr 20, 2018, at 5:56 PM, Weichen Xu  wrote:
> 
> Right. If regularization item isn't zero, then enable/disable standardization 
> will get different result.
> But, if comparing results between R-glmnet and mllib, if we set the same 
> parameters for regularization/standardization/... , then we should get the 
> same result. If not, then maybe there's a bug. In this case you can paste 
> your testing code and I can help fix it.
> 
> On Sat, Apr 21, 2018 at 1:06 AM, Valeriy Avanesov  <mailto:acop...@gmail.com>> wrote:
> Hi all.
> 
> Filipp, do you use l1/l2/elstic-net penalization? I believe in this case 
> standardization matters.
> 
> Best,
> 
> Valeriy.
> 
> 
> On 04/17/2018 11:40 AM, Weichen Xu wrote:
>> Not a bug.
>> 
>> When disabling standadization, mllib LR will still do standadization for 
>> features, but it will scale the coefficients back at the end (after training 
>> finished). So it will get the same result with no standadization training. 
>> The purpose of it is to improve the rate of convergence. So the result 
>> should be always exactly the same with R's glmnet, no matter enable or 
>> disable standadization. 
>> 
>> Thanks!
>> 
>> On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang > <mailto:yblia...@gmail.com>> wrote:
>> Hi Filipp,
>> 
>> MLlib’s LR implementation did the same way as R’s glmnet for 
>> standardization. 
>> Actually you don’t need to care about the implementation detail, as the 
>> coefficients are always returned on the original scale, so it should be 
>> return the same result as other popular ML libraries.
>> Could you point me where glmnet doesn’t scale features? 
>> I suspect other issues cause your prediction quality dropped. If you can 
>> share the code and data, I can help to check it.
>> 
>> Thanks
>> Yanbo
>> 
>> 
>>> On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin >> <mailto:filipp.zhin...@gmail.com>> wrote:
>>> 
>>> Hi all,
>>> 
>>> While migrating from custom LR implementation to MLLib's LR implementation 
>>> my colleagues noticed that prediction quality dropped (accoring to 
>>> different business metrics).
>>> It's turned out that this issue caused by features standardization perfomed 
>>> by MLLib's LR: disregard to 'standardization' option's value all features 
>>> are scaled during loss and gradient computation (as well as in few other 
>>> places): 
>>> https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
>>>  
>>> <https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229>
>>> 
>>> According to comments in the code, standardization should be implemented 
>>> the same way it was implementes in R's glmnet package. I've looked through 
>>> corresponding Fortran code, an it seems like glmnet don't scale features 
>>> when you're disabling standardisation (but MLLib still does).
>>> 
>>> Our models contains multiple one-hot encoded features and scaling them is a 
>>> pretty bad idea.
>>> 
>>> Why MLLib's LR always scale all features? From my POV it's a bug.
>>> 
>>> Thanks in advance,
>>> Filipp.
>>> 
>> 
>> 
> 
>

Re: Will higher order functions in spark SQL be pushed upstream?

2017-10-10 Thread DB Tsai

Hello,

At Netflix's algorithm team, we work on ranking problems a lot where
we naturally deal with the dataset with nested list of the structs. We
built Scala APIs like map, filter, drop, withColumn that can work on
the nested list of structs efficiently using SQL expression with
codegen.

Here is what we purpose on how APIs will look like, and we would like
to socialize with community to get more feedback!

https://issues.apache.org/jira/browse/SPARK-22231

It will be cool to share some building blocks with Databricks's higher
order function feature.

Thanks.

On Fri, Jun 9, 2017 at 5:04 PM, Antoine HOM  wrote:
> Good news :) Thx Sameer.
>
>
> On Friday, June 9, 2017, Sameer Agarwal  wrote:
>>>
>>> * As a heavy user of complex data types I was wondering if there was
>>> any plan to push those changes upstream?
>>
>>
>> Yes, we intend to contribute this to open source.
>>
>>>
>>> * In addition, I was wondering if as part of this change it also tries
>>> to solve the column pruning / filter pushdown issues with complex
>>> datatypes?
>>
>>
>> For parquet, this effort is primarily tracked via SPARK-4502 (see
>> https://github.com/apache/spark/pull/16578) and is currently targeted for
>> 2.3.

-- 
Sincerely,

DB Tsai
--
PGP Key ID: 0x5CED8B896A6BDFA0

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcoming Tejas Patil as a Spark committer

2017-10-06 Thread DB Tsai

Congratulations!

On Wed, Oct 4, 2017 at 6:55 PM, Liwei Lin  wrote:
> Congratulations!
>
> Cheers,
> Liwei
>
> On Wed, Oct 4, 2017 at 2:27 PM, Yuval Itzchakov  wrote:
>>
>> Congratulations and Good luck! :)
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread DB Tsai

+1

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0


On Fri, Oct 6, 2017 at 7:46 AM, Felix Cheung  wrote:
> Thanks Nick, Hyukjin. Yes this seems to be a longer standing issue on RHEL
> with respect to forking.
>
> 
> From: Nick Pentreath 
> Sent: Friday, October 6, 2017 6:16:53 AM
> To: Hyukjin Kwon
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.2 (RC4)
>
> Ah yes - I recall that it was fixed. Forgot it was for 2.3.0
>
> My +1 vote stands.
>
> On Fri, 6 Oct 2017 at 15:15 Hyukjin Kwon  wrote:
>>
>> Hi Nick,
>>
>> I believe that R test failure is due to SPARK-21093, at least the error
>> message looks the same, and that is fixed from 2.3.0. This was not
>> backported because I and reviewers were worried as that fixed a very core to
>> SparkR (even, it was reverted once even after very close look by some
>> reviewers).
>>
>> I asked Michael to note this as a known issue in
>> https://spark.apache.org/releases/spark-release-2-2-0.html#known-issues
>> before due to this reason.
>> I believe It should be fine and probably we should note if possible. I
>> believe this should not be a regression anyway as, if I understood
>> correctly, it was there from the very first place.
>>
>> Thanks.
>>
>>
>>
>>
>> 2017-10-06 21:20 GMT+09:00 Nick Pentreath :
>>>
>>> Checked sigs & hashes.
>>>
>>> Tested on RHEL
>>> build/mvn -Phadoop-2.7 -Phive -Pyarn test passed
>>> Python tests passed
>>>
>>> I ran R tests and am getting some failures:
>>> https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to
>>> recall similar issues on a previous release but I thought it was fixed).
>>>
>>> I re-ran R tests on an Ubuntu box to double check and they passed there.
>>>
>>> So I'd still +1 the release
>>>
>>> Perhaps someone can take a look at the R failures on RHEL just in case
>>> though.
>>>
>>>
>>> On Fri, 6 Oct 2017 at 05:58 vaquar khan  wrote:
>>>>
>>>> +1 (non binding ) tested on Ubuntu ,all test case  are passed.
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>> On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon 
>>>> wrote:
>>>>>
>>>>> +1 too.
>>>>>
>>>>>
>>>>> On 6 Oct 2017 10:49 am, "Reynold Xin"  wrote:
>>>>>
>>>>> +1
>>>>>
>>>>>
>>>>> On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau 
>>>>> wrote:
>>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00 PST 
>>>>>> and
>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>>
>>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is v2.1.2-rc4
>>>>>> (2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>>>>
>>>>>> List of JIRA tickets resolved in this release can be found with this
>>>>>> filter.
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>>>>
>>>>>> Release artifacts are signed with a key from:
>>>>>> https://people.apache.org/~holden/holdens_keys.asc
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>>>>
>>>>>>
>>>>>> FAQ
>>>>>>
>>>>>> How can I help test this release?
>>>>>>
>>>>>> If you are a Spark user, you can help us test this release b

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-10 Thread DB Tsai

I backported the fix into both branch-2.1 and branch-2.0. Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0


On Mon, Apr 10, 2017 at 4:20 PM, Ryan Blue  wrote:
> DB,
>
> This vote already failed and there isn't a RC3 vote yet. If you backport the
> changes to branch-2.1 they will make it into the next RC.
>
> rb
>
> On Mon, Apr 10, 2017 at 3:55 PM, DB Tsai  wrote:
>>
>> -1
>>
>> I think that back-porting SPARK-20270 and SPARK-18555 are very important
>> since it's a critical bug that na.fill will mess up the data in Long even
>> the data isn't null.
>>
>> Thanks.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0x5CED8B896A6BDFA0
>>
>> On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau 
>> wrote:
>>>
>>> Following up, the issues with missing pypandoc/pandoc on the packaging
>>> machine has been resolved.
>>>
>>> On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau 
>>> wrote:
>>>>
>>>> See SPARK-20216, if Michael can let me know which machine is being used
>>>> for packaging I can see if I can install pandoc on it (should be simple but
>>>> I know the Jenkins cluster is a bit on the older side).
>>>>
>>>> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau 
>>>> wrote:
>>>>>
>>>>> So the fix is installing pandoc on whichever machine is used for
>>>>> packaging. I thought that was generally done on the machine of the person
>>>>> rolling the release so I wasn't sure it made sense as a JIRA, but from
>>>>> chatting with Josh it sounds like that part might be on of the Jenkins
>>>>> workers - is there a fixed one that is used?
>>>>>
>>>>> Regardless I'll file a JIRA for this when I get back in front of my
>>>>> desktop (~1 hour or so).
>>>>>
>>>>> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust
>>>>>  wrote:
>>>>>>
>>>>>> Thanks for the comments everyone.  This vote fails.  Here's how I
>>>>>> think we should proceed:
>>>>>>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>>>>>>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
>>>>>> report if this is a regression and if there is an easy fix that we should
>>>>>> wait for.
>>>>>>
>>>>>> For all the other test failures, please take the time to look through
>>>>>> JIRA and open an issue if one does not already exist so that we can 
>>>>>> triage
>>>>>> if these are just environmental issues.  If I don't hear any objections 
>>>>>> I'm
>>>>>> going to go ahead with RC3 tomorrow.
>>>>>>
>>>>>> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung
>>>>>>  wrote:
>>>>>>>
>>>>>>> -1
>>>>>>> sorry, found an issue with SparkR CRAN check.
>>>>>>> Opened SPARK-20197 and working on fix.
>>>>>>>
>>>>>>> 
>>>>>>> From: holden.ka...@gmail.com  on behalf of
>>>>>>> Holden Karau 
>>>>>>> Sent: Friday, March 31, 2017 6:25:20 PM
>>>>>>> To: Xiao Li
>>>>>>> Cc: Michael Armbrust; dev@spark.apache.org
>>>>>>> Subject: Re: [VOTE] Apache Spark 2.1.1 (RC2)
>>>>>>>
>>>>>>> -1 (non-binding)
>>>>>>>
>>>>>>> Python packaging doesn't seem to have quite worked out (looking at
>>>>>>> PKG-INFO the description is "Description: ! missing pandoc do not 
>>>>>>> upload
>>>>>>> to PyPI "), ideally it would be nice to have this as a version we
>>>>>>> upgrade to PyPi.
>>>>>>> Building this on my own machine results in a longer description.
>>>>>>>
>>>>>>> My guess is that whichever machine was used to package this is
>>>>>>> missing the pandoc executable (or possibly pypandoc library).
>>>>>>>
>>>>>>> On Fri, Mar 31, 2017 at

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-10 Thread DB Tsai

-1

I think that back-porting SPARK-20270
<https://github.com/apache/spark/pull/17577> and SPARK-18555
<https://github.com/apache/spark/pull/15994> are very important since it's
a critical bug that na.fill will mess up the data in Long even the data
isn't null.

Thanks.


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0

On Wed, Apr 5, 2017 at 11:12 AM, Holden Karau  wrote:

> Following up, the issues with missing pypandoc/pandoc on the packaging
> machine has been resolved.
>
> On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau  wrote:
>
>> See SPARK-20216, if Michael can let me know which machine is being used
>> for packaging I can see if I can install pandoc on it (should be simple but
>> I know the Jenkins cluster is a bit on the older side).
>>
>> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau 
>> wrote:
>>
>>> So the fix is installing pandoc on whichever machine is used for
>>> packaging. I thought that was generally done on the machine of the person
>>> rolling the release so I wasn't sure it made sense as a JIRA, but from
>>> chatting with Josh it sounds like that part might be on of the Jenkins
>>> workers - is there a fixed one that is used?
>>>
>>> Regardless I'll file a JIRA for this when I get back in front of my
>>> desktop (~1 hour or so).
>>>
>>> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust 
>>> wrote:
>>>
>>>> Thanks for the comments everyone.  This vote fails.  Here's how I think
>>>> we should proceed:
>>>>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>>>>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
>>>> report if this is a regression and if there is an easy fix that we should
>>>> wait for.
>>>>
>>>> For all the other test failures, please take the time to look through
>>>> JIRA and open an issue if one does not already exist so that we can triage
>>>> if these are just environmental issues.  If I don't hear any objections I'm
>>>> going to go ahead with RC3 tomorrow.
>>>>
>>>> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung >>> > wrote:
>>>>
>>>> -1
>>>> sorry, found an issue with SparkR CRAN check.
>>>> Opened SPARK-20197 and working on fix.
>>>>
>>>> --
>>>> *From:* holden.ka...@gmail.com  on behalf of
>>>> Holden Karau 
>>>> *Sent:* Friday, March 31, 2017 6:25:20 PM
>>>> *To:* Xiao Li
>>>> *Cc:* Michael Armbrust; dev@spark.apache.org
>>>> *Subject:* Re: [VOTE] Apache Spark 2.1.1 (RC2)
>>>>
>>>> -1 (non-binding)
>>>>
>>>> Python packaging doesn't seem to have quite worked out (looking
>>>> at PKG-INFO the description is "Description: ! missing pandoc do not
>>>> upload to PyPI "), ideally it would be nice to have this as a version
>>>> we upgrade to PyPi.
>>>> Building this on my own machine results in a longer description.
>>>>
>>>> My guess is that whichever machine was used to package this is missing
>>>> the pandoc executable (or possibly pypandoc library).
>>>>
>>>> On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li  wrote:
>>>>
>>>> +1
>>>>
>>>> Xiao
>>>>
>>>> 2017-03-30 16:09 GMT-07:00 Michael Armbrust :
>>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30
>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v2.1.1-rc2
>>>> <https://github.com/apache/spark/tree/v2.1.1-rc2> (
>>>> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>>>>
>>>> List of JIRA tickets resolved can be found with this filter
>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>> .
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>

Re: welcoming Xiao Li as a committer

2016-10-05 Thread DB Tsai

Congrats, Xiao!

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x9DCC1DBD7FC7BBB2


On Wed, Oct 5, 2016 at 2:36 PM, Fred Reiss  wrote:
> Congratulations, Xiao!
>
> Fred
>
>
> On Tuesday, October 4, 2016, Joseph Bradley  wrote:
>>
>> Congrats!
>>
>> On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta 
>> wrote:
>>>
>>> Congratulations Xiao!
>>>
>>> - Kousuke
>>>
>>> On 2016/10/05 7:44, Bryan Cutler wrote:
>>>
>>> Congrats Xiao!
>>>
>>> On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau 
>>> wrote:
>>>>
>>>> Congratulations :D :) Yay!
>>>>
>>>> On Tue, Oct 4, 2016 at 11:14 AM, Suresh Thalamati
>>>>  wrote:
>>>>>
>>>>> Congratulations, Xiao!
>>>>>
>>>>>
>>>>>
>>>>> > On Oct 3, 2016, at 10:46 PM, Reynold Xin  wrote:
>>>>> >
>>>>> > Hi all,
>>>>> >
>>>>> > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
>>>>> > committer. Xiao has been a super active contributor to Spark SQL. 
>>>>> > Congrats
>>>>> > and welcome, Xiao!
>>>>> >
>>>>> > - Reynold
>>>>> >
>>>>>
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271
>>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-06 Thread DB Tsai

+1 for renaming the jar file.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Apr 5, 2016 at 8:02 PM, Chris Fregly  wrote:
> perhaps renaming to Spark ML would actually clear up code and documentation
> confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin  wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley 
> wrote:
>>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
>> wrote:
>>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally move
>>> towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman
>>>  wrote:
>>>>
>>>> Overall this sounds good to me. One question I have is that in
>>>> addition to the ML algorithms we have a number of linear algebra
>>>> (various distributed matrices) and statistical methods in the
>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>> namespace in the 2.x series ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng 
>>>> > wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>>> >> built
>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>>> >> API has
>>>> >> been developed under the spark.ml package, while the old RDD-based
>>>> >> API has
>>>> >> been developed in parallel under the spark.mllib package. While it
>>>> >> was
>>>> >> easier to implement and experiment with new APIs under a new package,
>>>> >> it
>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>> >> with
>>>> >> overlapped functions.
>>>> >>
>>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>>> >> API in
>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>> >> development
>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>> >> counting
>>>> >> the lines of Scala code, from 1.5 to the current master we added
>>>> >> ~1
>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>>> >> to
>>>> >> gather more resources on the development of the DataFrame-based API
>>>> >> and to
>>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>>> >> MLlib
>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>> >>
>>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>>> >> unless
>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>> >> package.
>>>> >> * We still accept bug fixes in the RDD-based API.
>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>> >> series to
>>>> >> reach feature parity with the RDD-based API.
>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>> >> deprecate
>>>> >> the RDD-based API.
>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>> >> 3.0.
>>>> >>
>>>> >> Though the RDD-based API is already in de facto maintenance mode,
>>>> >> this
>>>> >> announcement will make it clear and hence important to both MLlib
>>>> >> developers
>>>> >> and users. So we’d greatly appreciate your feedback!
>>>> >>
>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>> >> causes
>>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>>> >> are no
>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >
>>>> > -
>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>> >
>>>
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Export BLAS module on Spark MLlib

2015-11-30 Thread DB Tsai

I used reflection initially, but I found it's very slow especially in
a tight loop. Maybe caching the reflection can help which I never try.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Nov 30, 2015 at 2:15 PM, Burak Yavuz  wrote:
> Or you could also use reflection like in this Spark Package:
> https://github.com/brkyvz/lazy-linalg/blob/master/src/main/scala/com/brkyvz/spark/linalg/BLASUtils.scala
>
> Best,
> Burak
>
> On Mon, Nov 30, 2015 at 12:48 PM, DB Tsai  wrote:
>>
>> The workaround is have your code in the same package, or write some
>> utility wrapper in the same package so you can use them in your code.
>> Mostly we implement those BLAS for our own need, and we don't have
>> general use-case in mind. As a result, if we open them up prematurely,
>> it will add our api maintenance cost. Once it's getting mature, and
>> people are asking for them, we will gradually make them public.
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Sat, Nov 28, 2015 at 5:20 AM, Sasaki Kai  wrote:
>> > Hello
>> >
>> > I'm developing a Spark package that manipulates Vector and Matrix for
>> > machine learning.
>> > This package uses mllib.linalg.Vector and mllib.linalg.Matrix in order
>> > to
>> > achieve compatible interface to mllib itself. But mllib.linalg.BLAS
>> > module
>> > is private inside spark package. We cannot use BLAS from spark package.
>> > Due to this, there is no way to manipulate mllib.linalg.{Vector, Matrix}
>> > from spark package side.
>> >
>> > Is there any reason why BLAS module is not set public?
>> > If we cannot use BLAS, what is the reasonable option to manipulate
>> > Vector
>> > and Matrix from spark package?
>> >
>> > Regards
>> > Kai Sasaki(@Lewuathe)
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Export BLAS module on Spark MLlib

2015-11-30 Thread DB Tsai

The workaround is have your code in the same package, or write some
utility wrapper in the same package so you can use them in your code.
Mostly we implement those BLAS for our own need, and we don't have
general use-case in mind. As a result, if we open them up prematurely,
it will add our api maintenance cost. Once it's getting mature, and
people are asking for them, we will gradually make them public.

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Sat, Nov 28, 2015 at 5:20 AM, Sasaki Kai  wrote:
> Hello
>
> I'm developing a Spark package that manipulates Vector and Matrix for
> machine learning.
> This package uses mllib.linalg.Vector and mllib.linalg.Matrix in order to
> achieve compatible interface to mllib itself. But mllib.linalg.BLAS module
> is private inside spark package. We cannot use BLAS from spark package.
> Due to this, there is no way to manipulate mllib.linalg.{Vector, Matrix}
> from spark package side.
>
> Is there any reason why BLAS module is not set public?
> If we cannot use BLAS, what is the reasonable option to manipulate Vector
> and Matrix from spark package?
>
> Regards
> Kai Sasaki(@Lewuathe)
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Ability to offer initial coefficients in ml.LogisticRegression

2015-11-02 Thread DB Tsai

Hi YiZhi,

Sure. I think Holden already created a JIRA for this. Please
coordinate with Holden, and keep me in the loop. Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Nov 2, 2015 at 7:32 AM, YiZhi Liu  wrote:
> Hi Tsai,
>
> Is it proper if I create a jira and try to work on it?
>
> 2015-10-23 10:40 GMT+08:00 YiZhi Liu :
>> Thank you Tsai.
>>
>> Holden, would you mind posting the JIRA issue id here? I searched but
>> found nothing. Thanks.
>>
>> 2015-10-23 1:36 GMT+08:00 DB Tsai :
>>> There is a JIRA for this. I know Holden is interested in this.
>>>
>>>
>>> On Thursday, October 22, 2015, YiZhi Liu  wrote:
>>>>
>>>> Would someone mind giving some hint?
>>>>
>>>> 2015-10-20 15:34 GMT+08:00 YiZhi Liu :
>>>> > Hi all,
>>>> >
>>>> > I noticed that in ml.classification.LogisticRegression, users are not
>>>> > allowed to set initial coefficients, while it is supported in
>>>> > mllib.classification.LogisticRegressionWithSGD.
>>>> >
>>>> > Sometimes we know specific coefficients are close to the final optima.
>>>> > e.g., we usually pick yesterday's output model as init coefficients
>>>> > since the data distribution between two days' training sample
>>>> > shouldn't change much.
>>>> >
>>>> > Is there any concern for not supporting this feature?
>>>> >
>>>> > --
>>>> > Yizhi Liu
>>>> > Senior Software Engineer / Data Mining
>>>> > www.mvad.com, Shanghai, China
>>>>
>>>>
>>>>
>>>> --
>>>> Yizhi Liu
>>>> Senior Software Engineer / Data Mining
>>>> www.mvad.com, Shanghai, China
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>
>>>
>>>
>>> --
>>> - DB
>>>
>>> Sent from my iPhone
>>
>>
>>
>> --
>> Yizhi Liu
>> Senior Software Engineer / Data Mining
>> www.mvad.com, Shanghai, China
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Spark MLlib] about linear regression issue

2015-11-01 Thread DB Tsai

For the constrains like all weights >=0, people do LBFGS-B which is
supported in our optimization library, Breeze.
https://github.com/scalanlp/breeze/issues/323

However, in Spark's LiR, our implementation doesn't have constrain
implementation. I do see this is useful given we're experimenting
SLIM: Sparse Linear Methods for recommendation,
http://www-users.cs.umn.edu/~xning/papers/Ning2011c.pdf which requires
all the weights to be positive (Eq. 3) to represent positive relations
between items.

In summary, it's possible and not difficult to add this constrain to
our current linear regression, but currently, there is no open source
implementation in Spark.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Sun, Nov 1, 2015 at 9:22 AM, Zhiliang Zhu  wrote:
> Dear All,
>
> As for N dimension linear regression, while the labeled training points
> number (or the rank of the labeled point space) is less than N,
> then from perspective of math, the weight of the trained linear model may be
> not unique.
>
> However, the output of model.weight() by spark may be with some wi < 0. My
> issue is, is there some proper way only to get
> some specific output weight with all wi >= 0 ...
>
> Yes, the above goes same with the issue about solving linear system of
> equations, Aw = b, and r(A, b) = r(A) < columnNo(A), then w is
> with infinite solutions, but here only needs one solution with all wi >= 0.
> When there is only unique solution, both LR and SVD will work perfect.
>
> I will appreciate your all kind help very much~~
> Best Regards,
> Zhiliang
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai

Hi Meihua,

For categorical features, the ordinal issue can be solved by trying
all kind of different partitions 2^(q-1) -1 for q values into two
groups. However, it's computational expensive. In Hastie's book, in
9.2.4, the trees can be trained by sorting the residuals and being
learnt as if they are ordered. It can be proven that it will give the
optimal solution. I have a proof that this works for learning
regression trees through variance reduction.

I'm also interested in understanding how the L1 and L2 regularization
within the boosting works (and if it helps with overfitting more than
shrinkage).

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu  wrote:
> Hi DB Tsai,
>
> Thank you very much for your interest and comment.
>
> 1) feature sub-sample is per-node, like random forest.
>
> 2) The current code heavily exploits the tree structure to speed up
> the learning (such as processing multiple learning node in one pass of
> the training data). So a generic GBM is likely to be a different
> codebase. Do you have any nice reference of efficient GBM? I am more
> than happy to look into that.
>
> 3) The algorithm accept training data as a DataFrame with the
> featureCol indexed by VectorIndexer. You can specify which variable is
> categorical in the VectorIndexer. Please note that currently all
> categorical variables are treated as ordered. If you want some
> categorical variables as unordered, you can pass the data through
> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> unordered categorical variable using the approach in RF in Spark ML
> (Please see roadmap in the README.md)
>
> Thanks,
>
> Meihua
>
>
>
> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>> you think you can implement generic GBM and have it merged as part of
>> Spark codebase?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>  wrote:
>>> Hi Spark User/Dev,
>>>
>>> Inspired by the success of XGBoost, I have created a Spark package for
>>> gradient boosting tree with 2nd order approximation of arbitrary
>>> user-defined loss functions.
>>>
>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>
>>> Currently linear (normal) regression, binary classification, Poisson
>>> regression are supported. You can extend with other loss function as
>>> well.
>>>
>>> L1, L2, bagging, feature sub-sampling are also employed to avoid 
>>> overfitting.
>>>
>>> Thank you for testing. I am looking forward to your comments and
>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>
>>> Many thanks!
>>>
>>> Meihua
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai

Also, does it support categorical feature?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> you think you can implement generic GBM and have it merged as part of
> Spark codebase?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>  wrote:
>> Hi Spark User/Dev,
>>
>> Inspired by the success of XGBoost, I have created a Spark package for
>> gradient boosting tree with 2nd order approximation of arbitrary
>> user-defined loss functions.
>>
>> https://github.com/rotationsymmetry/SparkXGBoost
>>
>> Currently linear (normal) regression, binary classification, Poisson
>> regression are supported. You can extend with other loss function as
>> well.
>>
>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>
>> Thank you for testing. I am looking forward to your comments and
>> suggestions. Bugs or improvements can be reported through GitHub.
>>
>> Many thanks!
>>
>> Meihua
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai

Interesting. For feature sub-sampling, is it per-node or per-tree? Do
you think you can implement generic GBM and have it merged as part of
Spark codebase?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
 wrote:
> Hi Spark User/Dev,
>
> Inspired by the success of XGBoost, I have created a Spark package for
> gradient boosting tree with 2nd order approximation of arbitrary
> user-defined loss functions.
>
> https://github.com/rotationsymmetry/SparkXGBoost
>
> Currently linear (normal) regression, binary classification, Poisson
> regression are supported. You can extend with other loss function as
> well.
>
> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>
> Thank you for testing. I am looking forward to your comments and
> suggestions. Bugs or improvements can be reported through GitHub.
>
> Many thanks!
>
> Meihua
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Ability to offer initial coefficients in ml.LogisticRegression

2015-10-22 Thread DB Tsai

There is a JIRA for this. I know Holden is interested in this.

On Thursday, October 22, 2015, YiZhi Liu  wrote:

> Would someone mind giving some hint?
>
> 2015-10-20 15:34 GMT+08:00 YiZhi Liu >:
> > Hi all,
> >
> > I noticed that in ml.classification.LogisticRegression, users are not
> > allowed to set initial coefficients, while it is supported in
> > mllib.classification.LogisticRegressionWithSGD.
> >
> > Sometimes we know specific coefficients are close to the final optima.
> > e.g., we usually pick yesterday's output model as init coefficients
> > since the data distribution between two days' training sample
> > shouldn't change much.
> >
> > Is there any concern for not supporting this feature?
> >
> > --
> > Yizhi Liu
> > Senior Software Engineer / Data Mining
> > www.mvad.com, Shanghai, China
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>

-- 
- DB

Sent from my iPhone

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread DB Tsai

Hi Liu,

In ML, even after extracting the data into RDD, the versions between MLib
and ML are quite different. Due to legacy design, in MLlib, we use Updater
for handling regularization, and this layer of abstraction also does
adaptive step size which is only for SGD. In order to get it working with
LBFGS, some hacks were being done here and there, and in Updater, all the
components including intercept are regularized which is not desirable in
many cases. Also, in the legacy design, it's hard for us to do in-place
standardization to improve the convergency rate. As a result, at some
point, we decide to ditch those abstractions, and customize them for each
algorithms. (Even LiR and LoR use different tricks to have better
performance for numerical optimization, so it's hard to share code at that
time. But I can see the point that we have working code now, so it's time
to try to refactor those code to share more.)


Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu  wrote:

> Hi Joseph,
>
> Thank you for clarifying the motivation that you setup a different API
> for ml pipelines, it sounds great. But I still think we could extract
> some common parts of the training & inference procedures for ml and
> mllib. In ml.classification.LogisticRegression, you simply transform
> the DataFrame into RDD and follow the same procedures in
> mllib.optimization.{LBFGS,OWLQN}, right?
>
> My suggestion is, if I may, ml package should focus on the public API,
> and leave the underlying implementations, e.g. numerical optimization,
> to mllib package.
>
> Please let me know if my understanding has any problem. Thank you!
>
> 2015-10-08 1:15 GMT+08:00 Joseph Bradley :
> > Hi YiZhi Liu,
> >
> > The spark.ml classes are part of the higher-level "Pipelines" API, which
> > works with DataFrames.  When creating this API, we decided to separate it
> > from the old API to avoid confusion.  You can read more about it here:
> > http://spark.apache.org/docs/latest/ml-guide.html
> >
> > For (3): We use Breeze, but we have to modify it in order to do
> distributed
> > optimization based on Spark.
> >
> > Joseph
> >
> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu  wrote:
> >>
> >> Hi everyone,
> >>
> >> I'm curious about the difference between
> >> ml.classification.LogisticRegression and
> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> >> optimized using LBFGS, the only difference I see is LogisticRegression
> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
> >>
> >> So I wonder,
> >> 1. Why not simply add a DataFrame training interface to
> >> LogisticRegressionWithLBFGS?
> >> 2. Whats the difference between ml.classification and
> >> mllib.classification package?
> >> 3. Why doesn't ml.classification.LogisticRegression call
> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> >> in mllib.optimization.{LBFGS,OWLQN}.
> >>
> >> Thank you.
> >>
> >> Best,
> >>
> >> --
> >> Yizhi Liu
> >> Senior Software Engineer / Data Mining
> >> www.mvad.com, Shanghai, China
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: MLlib: Anybody working on hierarchical topic models like HLDA?

2015-06-03 Thread DB Tsai

Is your HDP implementation based on distributed gibbs sampling? Thanks.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Wed, Jun 3, 2015 at 8:13 PM, Yang, Yuhao  wrote:
> Hi Lorenz,
>
>
>
>   I’m trying to build a prototype of HDP for a customer based on the current
> LDA implementations. An initial version will probably be ready within the
> next one or two weeks. I’ll share it and hopefully we can join forces.
>
>
>
>   One concern is that I’m not sure how widely it will be used in the
> industry or community. Hope it’s popular enough to be accepted by Spark
> MLlib.
>
>
>
> http://www.cs.berkeley.edu/~jordan/papers/hierarchical-dp.pdf
>
> http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf
>
>
>
> Regards,
>
> Yuhao
>
>
>
> From: Joseph Bradley [mailto:jos...@databricks.com]
> Sent: Thursday, June 4, 2015 7:17 AM
> To: Lorenz Fischer
> Cc: dev@spark.apache.org
> Subject: Re: MLlib: Anybody working on hierarchical topic models like HLDA?
>
>
>
> Hi Lorenz,
>
>
>
> I'm not aware of people working on hierarchical topic models for MLlib, but
> that would be cool to see.  Hopefully other devs know more!
>
>
>
> Glad that the current LDA is helpful!
>
>
>
> Joseph
>
>
>
> On Wed, Jun 3, 2015 at 6:43 AM, Lorenz Fischer 
> wrote:
>
> Hi All
>
>
>
> I'm working on a project in which I use the current LDA implementation that
> has been contributed by Databricks' Joseph Bradley et al. for the recent
> 1.3.0 release (thanks guys!). While this is great, my project requires
> several levels of topics, as I would like to offer users to drill down into
> subtopics.
>
>
>
> As I understand it, Hierarchical Latent Dirichlet Allocation (HLDA) would
> offer such a hierarchy. Looking at the papers and talks by Blei [1,2] and
> Jordan [3], I think I should be able to implement HLDA in Spark using the
> Nested Chinese Restaurant Process (NCRP). However, as I have some time
> constraints, I'm not sure if I will have the time to do it 'the proper way'.
>
>
>
> In any case, I wanted to quickly ask around if anybody is already working on
> this or on some other form of a hierarchical topic model. Maybe I could
> contribute to these efforts instead of starting from scratch.
>
>
>
> Best,
>
> Lorenz
>
>
>
> [1] http://www.cs.princeton.edu/~blei/papers/BleiGriffithsJordan2009.pdf
>
> [2]
> http://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf
>
> [3] https://www.youtube.com/watch?v=PxgW3lOrj60
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: spark packages

2015-05-23 Thread DB Tsai

I thought LGPL is okay but GPL is not okay for Apache project.

On Saturday, May 23, 2015, Patrick Wendell  wrote:

> Yes - spark packages can include non ASF licenses.
>
> On Sat, May 23, 2015 at 6:16 PM, Debasish Das  > wrote:
> > Hi,
> >
> > Is it possible to add GPL/LGPL code on spark packages or it must be
> licensed
> > under Apache as well ?
> >
> > I want to expose Professor Tim Davis's LGPL library for sparse algebra
> and
> > ECOS GPL library through the package.
> >
> > Thanks.
> > Deb
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>

-- 
Sent from my iPhone

Re: Regularization in MLlib

2015-04-14 Thread DB Tsai

Hi Theodore,

I'm currently working on elastic-net regression in ML framework, and I
decided not to have any extra layer of abstraction for now but focus
on accuracy and performance. We may come out with proper solution
later. Any idea is welcome.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, Apr 14, 2015 at 6:54 AM, Theodore Vasiloudis
 wrote:
> Hello DB,
>
> could you elaborate a bit on how you are currently fixing this for the new
> ML pipeline framework?
> Are there any JIRAs/PR we could follow?
>
> Regards,
> Theodore
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Regularization-in-MLlib-tp11457p11583.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Regularization in MLlib

2015-04-07 Thread DB Tsai

1)  Norm(weights, N) will return (w_1^N + w_2^N +)^(1/N), so norm
* norm is required.

2) This is bug as you said. I intend to fix this using weighted
regularization, and intercept term will be regularized with weight
zero. https://github.com/apache/spark/pull/1518 But I never actually
have time to finish it. In the meantime, I'm fixing this without this
framework in new ML pipeline framework.

3) I think in the long term, we need weighted regularizer instead of
updater which couples regularization and adaptive step size update for
GD which is not needed in other optimization package.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com

On Tue, Apr 7, 2015 at 3:03 PM, Ulanov, Alexander
 wrote:
> Hi,
>
> Could anyone elaborate on the regularization in Spark? I've found that L1 and 
> L2 are implemented with Updaters (L1Updater, SquaredL2Updater).
> 1)Why the loss reported by L2 is (0.5 * regParam * norm * norm) where norm is 
> Norm(weights, 2.0)? It should be 0.5*regParam*norm (0.5 to disappear after 
> differentiation). It seems that it is mixed up with mean squared error.
> 2)Why all weights are regularized? I think we should leave the bias weights 
> (aka free or intercept) untouched if we don't assume that the data is 
> centralized.
> 3)Are there any short-term plans to move regularization from updater to a 
> more convenient place?
>
> Best regards, Alexander
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: LogisticGradient Design

2015-03-25 Thread DB Tsai

I did the benchmark when I used the if-else statement to switch the
binary & multinomial logistic loss and gradient, and there is no
performance hit at all. However, I'm refactoring the LogisticGradient
code so the addBias and scaling can be done in LogisticGradient
instead of the input dataset to avoid the second cache. In this case,
the code will be more complicated, so I will split the code into two
paths. Will be done in another PR.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Wed, Mar 25, 2015 at 11:57 AM, Joseph Bradley  wrote:
> It would be nice to see how big a performance hit we take from combining
> binary & multiclass logistic loss/gradient.  If it's not a big hit, then it
> might be simpler from an outside API perspective to keep them in 1 class
> (even if it's more complicated within).
> Joseph
>
> On Wed, Mar 25, 2015 at 8:15 AM, Debasish Das 
> wrote:
>
>> Hi,
>>
>> Right now LogisticGradient implements both binary and multi-class in the
>> same class using an if-else statement which is a bit convoluted.
>>
>> For Generalized matrix factorization, if the data has distinct ratings I
>> want to use LeastSquareGradient (regression has given best results to date)
>> but if the data has binary labels 0/1 based on domain knowledge (implicit
>> for example, visits no-visits) I want to use a LogisticGradient without any
>> overhead for multi-class if-else...
>>
>> I can compare the performance of LeastSquareGradient and multi-class
>> LogisticGradient on the recommendation metrics but it will be great if we
>> can separate binary and multi-class in Separate
>> classesMultiClassLogistic can extend BinaryLogistic but mixing them in
>> the same class is an overhead for users (like me) who wants to use
>> BinaryLogistic for his application..
>>
>> Thanks.
>> Deb
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread DB Tsai

It's a bug in breeze's side. Once David fixes it and publishes it to
maven, we can upgrade to breeze 0.11.2. Please file a jira ticket for
this issue. thanks.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Sun, Mar 15, 2015 at 12:45 AM, Yu Ishikawa
 wrote:
> Hi all,
>
> Is there any bugs to divide a Breeze sparse vector at Spark v1.3.0-rc3? When
> I tried to divide a sparse vector at Spark v1.3.0-rc3, I got a wrong result
> if the target vector has any zero values.
>
> Spark v1.3.0-rc3 depends on Breeze v0.11.1. And Breeze v0.11.1 seems to have
> any bugs to divide a sparse vector by a scalar value. When dividing a breeze
> sparse vector which has any zero values, the result seems to be a zero
> vector. However, we can run the same code on Spark v1.2.x.
>
> However, there is no problem to multiply a breeze sparse vector. I asked the
> breeze community this problem on the below issue.
> https://github.com/scalanlp/breeze/issues/382
>
> For example,
> ```
> test("dividing a breeze spark vector") {
> val vec = Vectors.sparse(6, Array(0, 4), Array(0.0, 10.0)).toBreeze
> val n = 60.0
> val answer1 = vec :/ n
> val answer2 = vec.toDenseVector :/ n
> println(vec)
> println(answer1)
> println(answer2)
> assert(answer1.toDenseVector === answer2)
> }
>
> SparseVector((0,0.0), (4,10.0))
> SparseVector()
> DenseVector(0.0, 0.0, 0.0, 0.0, 0.1, 0.0)
>
> DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0) did not equal DenseVector(0.0,
> 0.0, 0.0, 0.0, 0.1, 0.0)
> org.scalatest.exceptions.TestFailedException: DenseVector(0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0) did not equal DenseVector(0.0, 0.0, 0.0, 0.0,
> 0.1, 0.0)
> ```
>
> Thanks,
> Yu Ishikawa
>
>
>
> -
> -- Yu Ishikawa
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: LinearRegressionWithSGD accuracy

2015-01-28 Thread DB Tsai

Hi Robin,

You can try this PR out. This has built-in features scaling, and has
ElasticNet regularization (L1/L2 mix). This implementation can stably
converge to model from R's glmnet package.

https://github.com/apache/spark/pull/4259

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai



On Thu, Jan 15, 2015 at 9:42 AM, Robin East  wrote:
> -dev, +user
>
> You’ll need to set the gradient descent step size to something small - a bit 
> of trial and error shows that 0.0001 works.
>
> You’ll need to create a LinearRegressionWithSGD instance and set the step 
> size explicitly:
>
> val lr = new LinearRegressionWithSGD()
> lr.optimizer.setStepSize(0.0001)
> lr.optimizer.setNumIterations(100)
> val model = lr.run(parsedData)
>
> On 15 Jan 2015, at 16:46, devl.development  wrote:
>
>> From what I gather, you use LinearRegressionWithSGD to predict y or the
>> response variable given a feature vector x.
>>
>> In a simple example I used a perfectly linear dataset such that x=y
>> y,x
>> 1,1
>> 2,2
>> ...
>>
>> 1,1
>>
>> Using the out-of-box example from the website (with and without scaling):
>>
>> val data = sc.textFile(file)
>>
>>val parsedData = data.map { line =>
>>  val parts = line.split(',')
>> LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y
>> and x
>>
>>}
>>val scaler = new StandardScaler(withMean = true, withStd = true)
>>  .fit(parsedData.map(x => x.features))
>>val scaledData = parsedData
>>  .map(x =>
>>  LabeledPoint(x.label,
>>scaler.transform(Vectors.dense(x.features.toArray
>>
>>// Building the model
>>val numIterations = 100
>>val model = LinearRegressionWithSGD.train(parsedData, numIterations)
>>
>>// Evaluate model on training examples and compute training error *
>> tried using both scaledData and parsedData
>>val valuesAndPreds = scaledData.map { point =>
>>  val prediction = model.predict(point.features)
>>  (point.label, prediction)
>>}
>>val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
>>println("training Mean Squared Error = " + MSE)
>>
>> Both scaled and unscaled attempts give:
>>
>> training Mean Squared Error = NaN
>>
>> I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
>> still comes up with the same thing.
>>
>> Is this not supposed to work for x and y or 2 dimensional plots? Is there
>> something I'm missing or wrong in the code above? Or is there a limitation
>> in the method?
>>
>> Thanks for any advice.
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
>> Sent from the Apache Spark Developers List mailing list archive at 
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Maximum size of vector that reduce can handle

2015-01-23 Thread DB Tsai

Hi Alexander,

For `reduce`, it's an action that will collect all the data from
mapper to driver, and perform the aggregation in driver. As a result,
if the output from the mapper is very large, and the numbers of
partitions in mapper are large, it might cause a problem.

For `treeReduce`, as the name indicates, the way it works is in the
first layer, it aggregates the output of the mappers two by two
resulting half of the numbers of output. And then, we continuously do
the aggregation layer by layer. The final aggregation will be done in
driver but in this time, the numbers of data are small.

By default, depth 2 is used, so if you have so many partitions of
large vector, this may still cause issue. You can increase the depth
into higher numbers such that in the final reduce in driver, the
number of partitions are very small.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai



On Fri, Jan 23, 2015 at 12:07 PM, Ulanov, Alexander
 wrote:
> Hi DB Tsai,
>
> Thank you for your suggestion. Actually, I've started my experiments with 
> "treeReduce". Originally, I had "vv.treeReduce(_ + _, 2)" in my script 
> exactly because MLlib optimizers are using it, as you pointed out with LBFGS. 
> However, it leads to the same problems as "reduce", but presumably not so 
> directly. As far as I understand, treeReduce limits the number of 
> communications between workers and master forcing workers to partially 
> compute the reduce operation.
>
> Are you sure that driver will first collect all results (or all partial 
> results in treeReduce) and ONLY then perform aggregation? If that is the 
> problem, then how to force it to do aggregation after receiving each portion 
> of data from Workers?
>
> Best regards, Alexander
>
> -Original Message-
> From: DB Tsai [mailto:dbt...@dbtsai.com]
> Sent: Friday, January 23, 2015 11:53 AM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org
> Subject: Re: Maximum size of vector that reduce can handle
>
> Hi Alexander,
>
> When you use `reduce` to aggregate the vectors, those will actually be pulled 
> into driver, and merged over there. Obviously, it's not scaleable given you 
> are doing deep neural networks which have so many coefficients.
>
> Please try treeReduce instead which is what we do in linear regression and 
> logistic regression.
>
> See 
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala
> for example.
>
> val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 0.0))( 
> seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) => 
> val l = localGradient.compute( features, label, bcW.value, grad) (grad, loss 
> + l) }, combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, 
> loss2)) => axpy(1.0, grad2, grad1) (grad1, loss1 + loss2)
> })
>
> Sincerely,
>
> DB Tsai
> ---
> Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
>
> On Fri, Jan 23, 2015 at 10:00 AM, Ulanov, Alexander  
> wrote:
>> Dear Spark developers,
>>
>> I am trying to measure the Spark reduce performance for big vectors. My 
>> motivation is related to machine learning gradient. Gradient is a vector 
>> that is computed on each worker and then all results need to be summed up 
>> and broadcasted back to workers. For example, present machine learning 
>> applications involve very long parameter vectors, for deep neural networks 
>> it can be up to 2Billions. So, I want to measure the time that is needed for 
>> this operation depending on the size of vector and number of workers. I 
>> wrote few lines of code that assume that Spark will distribute partitions 
>> among all available workers. I have 6-machine cluster (Xeon 3.3GHz 4 cores, 
>> 16GB RAM), each runs 2 Workers.
>>
>> import org.apache.spark.mllib.rdd.RDDFunctions._
>> import breeze.linalg._
>> import org.apache.log4j._
>> Logger.getRootLogger.setLevel(Level.OFF)
>> val n = 6000
>> val p = 12
>> val vv = sc.parallelize(0 until p, p).map(i =>
>> DenseVector.rand[Double]( n )) vv.reduce(_ + _)
>>
>> When executing in shell with 60M vector it crashes after some period of 
>> time. One of the node contains the following in stdout:
>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>> os::commit_memory(0x00075550, 2863661056, 0) failed;
>> error='Cannot allocate memory' (errno=12) # # There is insufficient memory 
>> for the Java Runtime

Re: Maximum size of vector that reduce can handle

2015-01-23 Thread DB Tsai

Hi Alexander,

When you use `reduce` to aggregate the vectors, those will actually be
pulled into driver, and merged over there. Obviously, it's not
scaleable given you are doing deep neural networks which have so many
coefficients.

Please try treeReduce instead which is what we do in linear regression
and logistic regression.

See 
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala
for example.

val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 0.0))(
seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) =>
val l = localGradient.compute(
features, label, bcW.value, grad)
(grad, loss + l)
},
combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
axpy(1.0, grad2, grad1)
(grad1, loss1 + loss2)
})

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai



On Fri, Jan 23, 2015 at 10:00 AM, Ulanov, Alexander
 wrote:
> Dear Spark developers,
>
> I am trying to measure the Spark reduce performance for big vectors. My 
> motivation is related to machine learning gradient. Gradient is a vector that 
> is computed on each worker and then all results need to be summed up and 
> broadcasted back to workers. For example, present machine learning 
> applications involve very long parameter vectors, for deep neural networks it 
> can be up to 2Billions. So, I want to measure the time that is needed for 
> this operation depending on the size of vector and number of workers. I wrote 
> few lines of code that assume that Spark will distribute partitions among all 
> available workers. I have 6-machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM), 
> each runs 2 Workers.
>
> import org.apache.spark.mllib.rdd.RDDFunctions._
> import breeze.linalg._
> import org.apache.log4j._
> Logger.getRootLogger.setLevel(Level.OFF)
> val n = 6000
> val p = 12
> val vv = sc.parallelize(0 until p, p).map(i => DenseVector.rand[Double]( n ))
> vv.reduce(_ + _)
>
> When executing in shell with 60M vector it crashes after some period of time. 
> One of the node contains the following in stdout:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
> committing reserved memory.
>
> I run shell with --executor-memory 8G --driver-memory 8G, so handling 60M 
> vector of Double should not be a problem. Are there any big overheads for 
> this? What is the maximum size of vector that reduce can handle?
>
> Best regards, Alexander
>
> P.S.
>
> "spark.driver.maxResultSize 0" needs to set in order to run this code. I also 
> needed to change "java.io.tmpdir" and "spark.local.dir" folders because my 
> /tmp folder which is default, was too small and Spark swaps heavily into this 
> folder. Without these settings I get either "no space left on device" or "out 
> of memory" exceptions.
>
> I also submitted a bug https://issues.apache.org/jira/browse/SPARK-5386
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

1 2 >

1 - 100 of 181 matches

Mail list logo