Re: Release Apache Spark 2.4.4

2019-08-13 Thread Terry Kim
Can the following be included?

[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in
EpochTracker (to support Python UDFs)


Thanks,
Terry

On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan  wrote:

> +1
>
> On Wed, Aug 14, 2019 at 12:52 PM Holden Karau 
> wrote:
>
>> +1
>> Does anyone have any critical fixes they’d like to see in 2.4.4?
>>
>> On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:
>>
>>> Seems fine to me if there are enough valuable fixes to justify another
>>> release. If there are any other important fixes imminent, it's fine to
>>> wait for those.
>>>
>>>
>>> On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > Spark 2.4.3 was released three months ago (8th May).
>>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>>> `branch-24` since 2.4.3.
>>> >
>>> > It would be great if we can have Spark 2.4.4.
>>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>>> >
>>> > Last time, there was a request for K8s issue and now I'm waiting for
>>> SPARK-27900.
>>> > Please let me know if there is another issue.
>>> >
>>> > Thanks,
>>> > Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Release Apache Spark 2.4.4

2019-08-13 Thread Wenchen Fan
+1

On Wed, Aug 14, 2019 at 12:52 PM Holden Karau  wrote:

> +1
> Does anyone have any critical fixes they’d like to see in 2.4.4?
>
> On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:
>
>> Seems fine to me if there are enough valuable fixes to justify another
>> release. If there are any other important fixes imminent, it's fine to
>> wait for those.
>>
>>
>> On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Spark 2.4.3 was released three months ago (8th May).
>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>> `branch-24` since 2.4.3.
>> >
>> > It would be great if we can have Spark 2.4.4.
>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>> >
>> > Last time, there was a request for K8s issue and now I'm waiting for
>> SPARK-27900.
>> > Please let me know if there is another issue.
>> >
>> > Thanks,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Release Apache Spark 2.4.4

2019-08-13 Thread Holden Karau
+1
Does anyone have any critical fixes they’d like to see in 2.4.4?

On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:

> Seems fine to me if there are enough valuable fixes to justify another
> release. If there are any other important fixes imminent, it's fine to
> wait for those.
>
>
> On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Spark 2.4.3 was released three months ago (8th May).
> > As of today (13th August), there are 112 commits (75 JIRAs) in
> `branch-24` since 2.4.3.
> >
> > It would be great if we can have Spark 2.4.4.
> > Shall we start `2.4.4 RC1` next Monday (19th August)?
> >
> > Last time, there was a request for K8s issue and now I'm waiting for
> SPARK-27900.
> > Please let me know if there is another issue.
> >
> > Thanks,
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


RE: Release Apache Spark 2.4.4

2019-08-13 Thread Kazuaki Ishizaki
Thanks, Dongjoon!
+1

Kazuaki Ishizaki,



From:   Hyukjin Kwon 
To: Takeshi Yamamuro 
Cc: Dongjoon Hyun , dev 
, User 
Date:   2019/08/14 09:21
Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4



+1

2019년 8월 14일 (수) 오전 9:13, Takeshi Yamamuro 님
이 작성:
Hi,

Thanks for your notification, Dongjoon!
I put some links for the other committers/PMCs to access the info easily:

A commit list in github from the last release: 
https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...branch-2.4
A issue list in jira: 
https://issues.apache.org/jira/projects/SPARK/versions/12345466#release-report-tab-body
The 5 correctness issues resolved in branch-2.4:
https://issues.apache.org/jira/browse/SPARK-27798?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012345466%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

Anyway, +1

Best,
Takeshi

On Wed, Aug 14, 2019 at 8:25 AM DB Tsai  wrote:
+1

On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun  
wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released three months ago (8th May).
> As of today (13th August), there are 112 commits (75 JIRAs) in 
`branch-24` since 2.4.3.
>
> It would be great if we can have Spark 2.4.4.
> Shall we start `2.4.4 RC1` next Monday (19th August)?
>
> Last time, there was a request for K8s issue and now I'm waiting for 
SPARK-27900.
> Please let me know if there is another issue.
>
> Thanks,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-- 
---
Takeshi Yamamuro




Re: Release Apache Spark 2.4.4

2019-08-13 Thread Sean Owen
Seems fine to me if there are enough valuable fixes to justify another
release. If there are any other important fixes imminent, it's fine to
wait for those.


On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released three months ago (8th May).
> As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24` 
> since 2.4.3.
>
> It would be great if we can have Spark 2.4.4.
> Shall we start `2.4.4 RC1` next Monday (19th August)?
>
> Last time, there was a request for K8s issue and now I'm waiting for 
> SPARK-27900.
> Please let me know if there is another issue.
>
> Thanks,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Release Apache Spark 2.4.4

2019-08-13 Thread Hyukjin Kwon
+1

2019년 8월 14일 (수) 오전 9:13, Takeshi Yamamuro 님이 작성:

> Hi,
>
> Thanks for your notification, Dongjoon!
> I put some links for the other committers/PMCs to access the info easily:
>
> A commit list in github from the last release:
> https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...branch-2.4
> A issue list in jira:
> https://issues.apache.org/jira/projects/SPARK/versions/12345466#release-report-tab-body
> The 5 correctness issues resolved in branch-2.4:
>
> https://issues.apache.org/jira/browse/SPARK-27798?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012345466%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>
> Anyway, +1
>
> Best,
> Takeshi
>
> On Wed, Aug 14, 2019 at 8:25 AM DB Tsai  wrote:
>
>> +1
>>
>> On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Spark 2.4.3 was released three months ago (8th May).
>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>> `branch-24` since 2.4.3.
>> >
>> > It would be great if we can have Spark 2.4.4.
>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>> >
>> > Last time, there was a request for K8s issue and now I'm waiting for
>> SPARK-27900.
>> > Please let me know if there is another issue.
>> >
>> > Thanks,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Release Apache Spark 2.4.4

2019-08-13 Thread Takeshi Yamamuro
Hi,

Thanks for your notification, Dongjoon!
I put some links for the other committers/PMCs to access the info easily:

A commit list in github from the last release:
https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...branch-2.4
A issue list in jira:
https://issues.apache.org/jira/projects/SPARK/versions/12345466#release-report-tab-body
The 5 correctness issues resolved in branch-2.4:
https://issues.apache.org/jira/browse/SPARK-27798?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012345466%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

Anyway, +1

Best,
Takeshi

On Wed, Aug 14, 2019 at 8:25 AM DB Tsai  wrote:

> +1
>
> On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Spark 2.4.3 was released three months ago (8th May).
> > As of today (13th August), there are 112 commits (75 JIRAs) in
> `branch-24` since 2.4.3.
> >
> > It would be great if we can have Spark 2.4.4.
> > Shall we start `2.4.4 RC1` next Monday (19th August)?
> >
> > Last time, there was a request for K8s issue and now I'm waiting for
> SPARK-27900.
> > Please let me know if there is another issue.
> >
> > Thanks,
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Release Apache Spark 2.4.4

2019-08-13 Thread DB Tsai
+1

On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released three months ago (8th May).
> As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24` 
> since 2.4.3.
>
> It would be great if we can have Spark 2.4.4.
> Shall we start `2.4.4 RC1` next Monday (19th August)?
>
> Last time, there was a request for K8s issue and now I'm waiting for 
> SPARK-27900.
> Please let me know if there is another issue.
>
> Thanks,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Release Apache Spark 2.4.4

2019-08-13 Thread Dongjoon Hyun
Hi, All.

Spark 2.4.3 was released three months ago (8th May).
As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24`
since 2.4.3.

It would be great if we can have Spark 2.4.4.
Shall we start `2.4.4 RC1` next Monday (19th August)?

Last time, there was a request for K8s issue and now I'm waiting for
SPARK-27900.
Please let me know if there is another issue.

Thanks,
Dongjoon.


[K8S] properties file via SPARK_CONF_DIR and --properties-file prevents definition of own properties via secrets/own mounts

2019-08-13 Thread Roland Johann
Hi all,

K8S resource manager dumps the config map to 
/opt/spark/conf/spark-defaults.conf and passes it to spark submit twice:
• via env var SPARK_CONF_DIR=/opt/spark/conf/
• argument --properties-file /opt/spark/conf/spark-defaults.conf
This prevents definition of user defined properties, such as from secrets - 
currently it seems to be the only possibility to define spark config properties 
from k8s secrets.

Is the current implementation intent? If yes, what are the reasons behind that 
decision?

Thanks and kind regards

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.joh...@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann





Re: [DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread John Zhuge
Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if that
makes sense.

I can't find out a lot about view resolution and validation in SQL Spec
Part1. Anybody with full SQL knowledge may chime in.

Here are my understanding based on online manuals, docs, and other
resources:

   - A view has a name in the database schema so that other queries can use
   it like a table.
   - A view's schema is frozen at the time the view is created; subsequent
   changes to underlying tables (e.g. adding a column) will not be reflected
   in the view's schema. If an underlying table is dropped or changed in an
   incompatible fashion, subsequent attempts to query the invalid view will
   fail.

In Preso, view columns are used for validation only (see
StatementAnalyzer.Visitor#isViewStale):

   - view column names must match the visible fields of analyzed view sql
   - the visible fields can be coerced to view column types

In Spark 2.2+, view columns are also used for validation (see
CheckAnalysis#checkAnalysis case View):

   - view column names must match the output fields of the view sql
   - view column types must be able to UpCast to output field types

Rule EliminateView adds a Project to viewQueryColumnNames if it exists.

As for `softwareVersion`, the purpose is to track which software version is
used to create the view, in preparation for different versions of the same
software or even different softwares, such as Presto vs Spark.


On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue  wrote:

> Thanks for working on this, John!
>
> I'd like to see a more complete write-up of what you're proposing. Without
> that, I don't think we can have a productive discussion about this.
>
> For example, I think you're proposing to keep the view columns to ensure
> that the same columns are produced by the view every time, based on
> requirements from the SQL spec. Let's start by stating what those behavior
> requirements are, so that everyone has the context to understand why your
> proposal includes the view columns. Similarly, I'd like to know why you're
> proposing `softwareVersion` in the view definition.
>
> On Tue, Aug 13, 2019 at 8:56 AM John Zhuge  wrote:
>
>> Catalog support has been added to DSv2 along with a table catalog
>> interface. Here I'd like to propose a view catalog interface, for the
>> following benefit:
>>
>>- Abstraction for view management thus allowing different view
>>backends
>>- Disassociation of view definition storage from Hive Metastore
>>
>> A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
>> identifier as view first then table.
>>
>> More details in SPIP and PR if we decide to proceed. Here is a quick
>> glance at the API:
>>
>> ViewCatalog interface:
>>
>>- loadView
>>- listViews
>>- createView
>>- deleteView
>>
>> View interface:
>>
>>- name
>>- originalSql
>>- defaultCatalog
>>- defaultNamespace
>>- viewColumns
>>- owner
>>- createTime
>>- softwareVersion
>>- options (map)
>>
>> ViewColumn interface:
>>
>>- name
>>- type
>>
>>
>> Thanks,
>> John Zhuge
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge


Re: [DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread Ryan Blue
Thanks for working on this, John!

I'd like to see a more complete write-up of what you're proposing. Without
that, I don't think we can have a productive discussion about this.

For example, I think you're proposing to keep the view columns to ensure
that the same columns are produced by the view every time, based on
requirements from the SQL spec. Let's start by stating what those behavior
requirements are, so that everyone has the context to understand why your
proposal includes the view columns. Similarly, I'd like to know why you're
proposing `softwareVersion` in the view definition.

On Tue, Aug 13, 2019 at 8:56 AM John Zhuge  wrote:

> Catalog support has been added to DSv2 along with a table catalog
> interface. Here I'd like to propose a view catalog interface, for the
> following benefit:
>
>- Abstraction for view management thus allowing different view backends
>- Disassociation of view definition storage from Hive Metastore
>
> A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
> identifier as view first then table.
>
> More details in SPIP and PR if we decide to proceed. Here is a quick
> glance at the API:
>
> ViewCatalog interface:
>
>- loadView
>- listViews
>- createView
>- deleteView
>
> View interface:
>
>- name
>- originalSql
>- defaultCatalog
>- defaultNamespace
>- viewColumns
>- owner
>- createTime
>- softwareVersion
>- options (map)
>
> ViewColumn interface:
>
>- name
>- type
>
>
> Thanks,
> John Zhuge
>


-- 
Ryan Blue
Software Engineer
Netflix


[DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread John Zhuge
Catalog support has been added to DSv2 along with a table catalog
interface. Here I'd like to propose a view catalog interface, for the
following benefit:

   - Abstraction for view management thus allowing different view backends
   - Disassociation of view definition storage from Hive Metastore

A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
identifier as view first then table.

More details in SPIP and PR if we decide to proceed. Here is a quick glance
at the API:

ViewCatalog interface:

   - loadView
   - listViews
   - createView
   - deleteView

View interface:

   - name
   - originalSql
   - defaultCatalog
   - defaultNamespace
   - viewColumns
   - owner
   - createTime
   - softwareVersion
   - options (map)

ViewColumn interface:

   - name
   - type


Thanks,
John Zhuge


Re: displaying "Test build" in PR

2019-08-13 Thread Wenchen Fan
"Can one of the admins verify this patch?" is a corrected message, as
Jenkins won't test your PR until an admin approves it.

BTW I think "5 minutes" is a reasonable delay for PR testing. It usually
takes days to review and merge a PR, so I don't think seeing test progress
right after PR creation really matters.

On Tue, Aug 13, 2019 at 8:58 PM Younggyu Chun 
wrote:

> Thank you for your email.
>
> I think a newb like me might want to see what's going on PR and see
> something useful. For example, "Request builder polls every 5 minutes and
> you will see the progress here in a few minutes".  I guess we can add a
> more useful message on AmplabJenkins '
> message instead of a simple message like "Can one of the admins verify
> this patch?"
>
> Younggyu
>
> On Mon, Aug 12, 2019 at 3:55 PM Shane Knapp  wrote:
>
>> when you create a PR, the jenkins pull request builder job polls every ~5
>> or so minutes and will trigger jobs based on creation/approval to test/code
>> updates/etc.
>>
>> On Mon, Aug 12, 2019 at 11:25 AM Younggyu Chun 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have a quick question about PR. Once I create a PR I'm not able to see
>>> if "Test build" is being processed. But I can see this after a few minutes
>>> or hours later. Is it possible to see if "Test Build" is being processed
>>> after PR is created right away?
>>>
>>> Thank you,
>>> Younggyu Chun
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: My curation of pending structured streaming PRs to review

2019-08-13 Thread Sean Owen
General tips:

- dev@ is not usually the right place to discuss _specific_ changes
except once in a while to call attention
- Ping the authors of the code being changed directly
- Tighten the change if possible
- Tests, reproductions, docs, etc help prove the change
- Bugs are more important than new marginal features

If there has been some feedback that's just skeptical about the
approach or value, that may be the answer, it won't be merged.
If there is no feedback and it seems important (correctness bugs) it's
OK to raise that here once in a while.

One common theme here is 'structured streaming' -- who amongst the
committers feels they are able to review these changes? I sense we
have a shortage there.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: displaying "Test build" in PR

2019-08-13 Thread Younggyu Chun
Thank you for your email.

I think a newb like me might want to see what's going on PR and see
something useful. For example, "Request builder polls every 5 minutes and
you will see the progress here in a few minutes".  I guess we can add a
more useful message on AmplabJenkins '
message instead of a simple message like "Can one of the admins verify this
patch?"

Younggyu

On Mon, Aug 12, 2019 at 3:55 PM Shane Knapp  wrote:

> when you create a PR, the jenkins pull request builder job polls every ~5
> or so minutes and will trigger jobs based on creation/approval to test/code
> updates/etc.
>
> On Mon, Aug 12, 2019 at 11:25 AM Younggyu Chun 
> wrote:
>
>> Hi All,
>>
>> I have a quick question about PR. Once I create a PR I'm not able to see
>> if "Test build" is being processed. But I can see this after a few minutes
>> or hours later. Is it possible to see if "Test Build" is being processed
>> after PR is created right away?
>>
>> Thank you,
>> Younggyu Chun
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: My curation of pending structured streaming PRs to review

2019-08-13 Thread vikram agrawal
Thanks, Jungtaek for curating this list. It covers a lot of important fixes
and performance improvements in structured streaming.

Hi Devs

What is missing from process perspective from getting these PRs merged?
Apart from this list, is there any other forum where we can request
attention to such important PRs. Is the lack of reviews limited to
Structured streaming or are there other areas of spark which are suffering
from similar neglect? Does the community feel that we need a better
turnaround for PRs to make sure that we don't miss out on important
contributions and encourage newbies like me?

Thanks
Vikram

On Tue, Jul 16, 2019 at 10:41 AM Jungtaek Lim  wrote:

> Hi devs,
>
> As we make progress on some minor PRs on structured streaming, I'd like to
> remind about major PRs on SS area to get more chances to be reviewed.
>
> Please note that I only include existing PRs, so something still not
> discussed like queryable state is not included in the curation list. Also,
> I've excluded PRs on continuous processing, as I'm not fully sure about
> current direction and vision on this feature. Minor PRs are mostly excluded
> unless they are proposed for a long ago. Last, I could be biased on
> curating list.
>
> Let's get started!
>
> 
> A. File Source/Sink
>
> 1. [SPARK-20568][SS] Provide option to clean up completed files in
> streaming query
>
> ISSUE: https://issues.apache.org/jira/browse/SPARK-20568
> PR: https://github.com/apache/spark/pull/22952
>
> From the nature of "stream", the input data will grow infinitely and end
> users want to have a clear way to clean up completed files. Unlike batch
> query, structured streaming doesn't require all input files to be presented
> - once they've been committed (say, completed processing), they wouldn't be
> read from such query.
>
> This patch automatically cleans up input files when they're committed,
> with three options: 1) keep it as it is, 2) archive (move) to other
> directory 3) delete.
>
> 2. [SPARK-27188][SS] FileStreamSink: provide a new option to have
> retention on output files
>
> ISSUE: https://issues.apache.org/jira/browse/SPARK-27188
> PR: https://github.com/apache/spark/pull/24128
>
> File sink writes metadata which records list of output files to ensure
> file source to only read correct files, which helps to achieve end-to-end
> exactly once. But file sink has no idea when output files will not be
> accessed from downstream query, so metadata just grows infinitely and
> output files cannot be removed safely.
>
> This patch opens the chance for end users to provide TTL on output files
> so that metadata will eventually exclude expired output files as well as
> end users could remove the output files safely.
>
>
> B. Kafka Source/Sink
>
> 1. [SPARK-21869][SS] A cached Kafka producer should not be closed if any
> task is using it - adds inuse tracking.
>
> ISSUE: https://issues.apache.org/jira/browse/SPARK-21869
> PR: https://github.com/apache/spark/pull/19096
>
> This is a long-lasting bug (around 2 years after filing the JIRA issue):
> if some task uses cached Kafka producer longer than 10 minutes, pool will
> recognize it as "timed-out" and just close it. After closing undefined
> behavior from task side will occur.
>
> This patch adds "in-use" tracking on producer to address this. Please note
> that Kafka producer is thread-safe (whereas Kafka consumer is not) and we
> allow using it concurrently, so we can't adopt commons pool to pool
> producer. (Though we can still leverage commons pool if we are OK to not
> share between threads.)
>
> 2. [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
>
> ISSUE: https://issues.apache.org/jira/browse/SPARK-23539
> PR: https://github.com/apache/spark/pull/22282
>
> As there's great doc to rationalize the needs on supporting Kafka headers,
> I'll just let the doc explaining it.
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers
>
> Please note that the issue has been commented from end users regarding
> availability, which also represents the needs on end users' side.
>
> 3. [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer
>
> ISSUE: https://issues.apache.org/jira/browse/SPARK-25151
> PR: https://github.com/apache/spark/pull/22138
>
> Kafka source has its pooling logic for consumers, but as I saw some JIRA
> issues regarding pooling we seem to agree we would like to replace with
> known pool implementation which provides advanced configuration, detailed
> metrics, etc.
>
> This patch adopts Apache Commons Pool (which above advantages are brought)
> to be used as a connection pool for consumers, with respecting to current
> behavior whenever possible. It also separates pooling for consumer and
> fetched data which enables to maximize efficiency on pooling consumers, and
> also address the bug on unnecessary re-fetch on self-join. (The result of
> experiment is in PR's content.)
>
> 4. [SPARK-26848][SQL] Introduce new option to Kafka 

Re: Ask for ARM CI for spark

2019-08-13 Thread Tianhua huang
Hi all,

About the arm test of spark, recently we found two tests failed after the
commit https://github.com/apache/spark/pull/23767:
   ReplayListenerSuite:
   - ...
   - End-to-end replay *** FAILED ***
 "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
   - End-to-end replay with compression *** FAILED ***
 "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)

We tried to revert the commit and then the tests passed, the patch is too
big and so sorry we can't find the reason till now, if you are interesting
please try it, and it will be very appreciate  if someone can help
us to figure it out.

On Tue, Aug 6, 2019 at 9:08 AM bo zhaobo 
wrote:

> Hi shane,
> Thanks for your reply. I will wait for you back. ;-)
>
> Thanks,
> Best regards
> ZhaoBo
>
>
>
> [image: Mailtrack]
> 
>  Sender
> notified by
> Mailtrack
> 
>  19/08/06
> 上午09:06:23
>
> shane knapp  于2019年8月2日周五 下午10:41写道:
>
>> i'm out of town, but will answer some of your questions next week.
>>
>> On Fri, Aug 2, 2019 at 2:39 AM bo zhaobo 
>> wrote:
>>
>>>
>>> Hi Team,
>>>
>>> Any updates about the CI details? ;-)
>>>
>>> Also, I will also need your kind help about Spark QA test, could any one
>>> can tell us how to trigger that tests? When? How?  So far, I haven't
>>> notices how it works.
>>>
>>> Thanks
>>>
>>> Best Regards,
>>>
>>> ZhaoBo
>>>
>>>
>>>
>>> [image: Mailtrack]
>>> 
>>>  Sender
>>> notified by
>>> Mailtrack
>>> 
>>>  19/08/02
>>> 下午05:37:30
>>>
>>> bo zhaobo  于2019年7月31日周三 上午11:56写道:
>>>
 Hi, team.
 I want to make the same test on ARM like existing CI does(x86). As
 building and testing the whole spark projects will cost too long time, so I
 plan to split them to multiple jobs to run for lower time cost. But I
 cannot see what the existing CI[1] have done(so many private scripts
 called), so could any CI maintainers help/tell us for how to split them and
 the details about different CI jobs does? Such as PR title contains [SQL],
 [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib],
 [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run
 the different CI job.

 @shane knapp,
 Oh, sorry for disturb. I found your email looks like from 'berkeley.edu',
 are you the good guy who we are looking for help about this? ;-)
 If so, could you give some helps or advices? Thank you.

 Thank you very much,

 Best Regards,

 ZhaoBo

 [1] https://amplab.cs.berkeley.edu/jenkins




 [image: Mailtrack]
 
  Sender
 notified by
 Mailtrack
 
  19/07/31
 上午11:53:36

 Tianhua huang  于2019年7月29日周一 上午9:38写道:

> @Sean Owen   Thank you very much. And I saw your
> reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I
> will test with modification and to see whether there are other similar
> tests fail, and will address them together in one pull request.
>
> On Sat, Jul 27, 2019 at 9:04 PM Sean Owen  wrote:
>
>> Great thanks - we can take this to JIRAs now.
>> I think it's worth changing the implementation of atanh if the test
>> value just reflects what Spark does, and there's evidence is a little bit
>> inaccurate.
>> There's an equivalent formula which seems to have better accuracy.
>>
>> On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <
>> linguin@gmail.com> wrote:
>>
>>> Hi, all,
>>>
>>> FYI:
>>> >> @Yuming Wang the results in float8.sql are from PostgreSQL
>>> directly?
>>> >> Interesting if it also returns the same less accurate result,
>>> which
>>> >> might suggest it's more to do with underlying OS math libraries.
>>> You
>>> >> noted that these tests sometimes gave platform-dependent
>>> differences
>>> >> in the last digit, so wondering if the test value directly
>>> reflects
>>> >> PostgreSQL or just what we happen to return now.
>>>
>>> The results in float8.sql.out were recomputed in Spark/JVM.
>>> The expected output of the PostgreSQL test is here:
>>> https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L493
>>>
>>> As you can see in the file (float8.out), the results other than atanh
>>> also are different between Spark/JVM and PostgreSQL.
>>> For example, the answers of acosh are:
>>> --