My curation of pending structured streaming PRs to review

2019-07-15 Thread Jungtaek Lim
Hi devs,

As we make progress on some minor PRs on structured streaming, I'd like to
remind about major PRs on SS area to get more chances to be reviewed.

Please note that I only include existing PRs, so something still not
discussed like queryable state is not included in the curation list. Also,
I've excluded PRs on continuous processing, as I'm not fully sure about
current direction and vision on this feature. Minor PRs are mostly excluded
unless they are proposed for a long ago. Last, I could be biased on
curating list.

Let's get started!


A. File Source/Sink

1. [SPARK-20568][SS] Provide option to clean up completed files in
streaming query

ISSUE: https://issues.apache.org/jira/browse/SPARK-20568
PR: https://github.com/apache/spark/pull/22952

>From the nature of "stream", the input data will grow infinitely and end
users want to have a clear way to clean up completed files. Unlike batch
query, structured streaming doesn't require all input files to be presented
- once they've been committed (say, completed processing), they wouldn't be
read from such query.

This patch automatically cleans up input files when they're committed, with
three options: 1) keep it as it is, 2) archive (move) to other directory 3)
delete.

2. [SPARK-27188][SS] FileStreamSink: provide a new option to have retention
on output files

ISSUE: https://issues.apache.org/jira/browse/SPARK-27188
PR: https://github.com/apache/spark/pull/24128

File sink writes metadata which records list of output files to ensure file
source to only read correct files, which helps to achieve end-to-end
exactly once. But file sink has no idea when output files will not be
accessed from downstream query, so metadata just grows infinitely and
output files cannot be removed safely.

This patch opens the chance for end users to provide TTL on output files so
that metadata will eventually exclude expired output files as well as end
users could remove the output files safely.


B. Kafka Source/Sink

1. [SPARK-21869][SS] A cached Kafka producer should not be closed if any
task is using it - adds inuse tracking.

ISSUE: https://issues.apache.org/jira/browse/SPARK-21869
PR: https://github.com/apache/spark/pull/19096

This is a long-lasting bug (around 2 years after filing the JIRA issue): if
some task uses cached Kafka producer longer than 10 minutes, pool will
recognize it as "timed-out" and just close it. After closing undefined
behavior from task side will occur.

This patch adds "in-use" tracking on producer to address this. Please note
that Kafka producer is thread-safe (whereas Kafka consumer is not) and we
allow using it concurrently, so we can't adopt commons pool to pool
producer. (Though we can still leverage commons pool if we are OK to not
share between threads.)

2. [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

ISSUE: https://issues.apache.org/jira/browse/SPARK-23539
PR: https://github.com/apache/spark/pull/22282

As there's great doc to rationalize the needs on supporting Kafka headers,
I'll just let the doc explaining it.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers

Please note that the issue has been commented from end users regarding
availability, which also represents the needs on end users' side.

3. [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer

ISSUE: https://issues.apache.org/jira/browse/SPARK-25151
PR: https://github.com/apache/spark/pull/22138

Kafka source has its pooling logic for consumers, but as I saw some JIRA
issues regarding pooling we seem to agree we would like to replace with
known pool implementation which provides advanced configuration, detailed
metrics, etc.

This patch adopts Apache Commons Pool (which above advantages are brought)
to be used as a connection pool for consumers, with respecting to current
behavior whenever possible. It also separates pooling for consumer and
fetched data which enables to maximize efficiency on pooling consumers, and
also address the bug on unnecessary re-fetch on self-join. (The result of
experiment is in PR's content.)

4. [SPARK-26848][SQL] Introduce new option to Kafka source: offset by
timestamp (starting/ending) SQL

ISSUE: https://issues.apache.org/jira/browse/SPARK-26848
PR: https://github.com/apache/spark/pull/23747

When end users would want to replay their records in Kafka topic, they
wouldn't memorize exact offsets per each partition but Spark requires to do
that, otherwise just start from earliest. We as human being are much
familiar with time, once we want to replay some records we know the
timestamp of records we should start from.

This patch opens the chance for end users to provide offset by timestamp
(either starting or ending, or both) which will be transparently passed on
Kafka when requesting.


C. State

1. [SPARK-27237][SS] Introduce State schema validation among query restart

ISSUE: https://issues.apache.org/jira/browse/SPARK-27237
PR: 

RE: JDBC connector for DataSourceV2

2019-07-15 Thread Priyanka Gomatam
I would have thought one of the most important goals would be pushing down 
limits since V2 supports it.

I am also interested in collaborating. Thanks!

Priyanka Gomatam

From: Shiv Prashant Sood 
Sent: Monday, July 15, 2019 10:22 AM
To: Gabor Somogyi 
Cc: Xianyin Xin ; Ryan Blue ; 
gengliang.w...@databricks.com; Spark Dev List 
Subject: Re: JDBC connector for DataSourceV2

Agree. Let's use 
SPARK-24907
 as the JIRA for this work. Thanks for resolving 
SPARK-28380
 as dupe of this.

Regards,
Shiv

On Mon, Jul 15, 2019 at 1:50 AM Gabor Somogyi 
mailto:gabor.g.somo...@gmail.com>> wrote:
I've had a look at the jiras and seems like the intention is the same (correct 
me if I'm wrong).
I think one is enough and the rest can be closed with duplicate.
We should keep multiple jiras only when the intention is different.

BR,
G


On Mon, Jul 15, 2019 at 6:01 AM Xianyin Xin 
mailto:xianyin@alibaba-inc.com>> wrote:
There’s another pr 
https://github.com/apache/spark/pull/21861
 but which is based the old V2 APIs.

We’d better link the JIRAs, 
SPARK-24907,
 
SPARK-25547,
 and 
SPARK-28380
 and finalize a plan.

Xianyin

From: Shiv Prashant Sood mailto:shivprash...@gmail.com>>
Date: Sunday, July 14, 2019 at 2:59 AM
To: Gabor Somogyi mailto:gabor.g.somo...@gmail.com>>
Cc: Xianyin Xin 
mailto:xianyin@alibaba-inc.com>>, Ryan Blue 
mailto:rb...@netflix.com>>, 
mailto:gengliang.w...@databricks.com>>, Spark 
Dev List mailto:dev@spark.apache.org>>
Subject: Re: JDBC connector for DataSourceV2

To me this looks like refactoring of DS1 JDBC to enable user provided 
connection factories. In itself a good change, but IMO not DSV2 related.

I created a JIRA and added some goals. Please comments/add as relevant.

https://issues.apache.org/jira/browse/SPARK-28380

JIRA for DataSourceV2 API based JDBC connector.

Goals :

  *   Generic connector based on JDBC that supports all databases (min bar is 
support for all V1 data bases).
  *   Reference implementation and Interface for any specialized JDBC 
connectors.

Regards,
Shiv

On Sat, Jul 13, 2019 at 2:17 AM Gabor Somogyi 
mailto:gabor.g.somo...@gmail.com>> wrote:
Hi Guys,

Don't know what's the intention exactly here but there is such a PR: 
https://github.com/apache/spark/pull/22560
If that's what we need maybe we can resurrect it. BTW, I'm also interested in...

BR,
G


On Sat, Jul 13, 2019 at 4:09 AM Shiv Prashant Sood 
mailto:shivprash...@gmail.com>> wrote:
Thanks all. I can also contribute toward this effort.

Regards,
Shiv
Sent from my iPhone

On Jul 12, 2019, 

Creating external Druid table

2019-07-15 Thread Valeriy Trofimov
Hi All,

How do you create an external Druid table via Spark?

I know that you can do it like this:
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-druid/content/druid_anatomy_of_hive_to_druid.html

But the issue is that Spark was built on Hive 1.2.1:
https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html

That version of hive didn't have support for Druid ingestion. So, running
the ingestion query shows me error.

What is the best solution for this?

Thanks,
Val


Re: JDBC connector for DataSourceV2

2019-07-15 Thread Shiv Prashant Sood
Agree. Let's use SPARK-24907
 as the JIRA for this
work. Thanks for resolving SPARK-28380
 as dupe of this.

Regards,
Shiv

On Mon, Jul 15, 2019 at 1:50 AM Gabor Somogyi 
wrote:

> I've had a look at the jiras and seems like the intention is the same
> (correct me if I'm wrong).
> I think one is enough and the rest can be closed with duplicate.
> We should keep multiple jiras only when the intention is different.
>
> BR,
> G
>
>
> On Mon, Jul 15, 2019 at 6:01 AM Xianyin Xin 
> wrote:
>
>> There’s another pr https://github.com/apache/spark/pull/21861 but which
>> is based the old V2 APIs.
>>
>>
>>
>> We’d better link the JIRAs, SPARK-24907
>> , SPARK-25547
>> , and SPARK-28380
>>  and finalize a plan.
>>
>>
>>
>> Xianyin
>>
>>
>>
>> *From: *Shiv Prashant Sood 
>> *Date: *Sunday, July 14, 2019 at 2:59 AM
>> *To: *Gabor Somogyi 
>> *Cc: *Xianyin Xin , Ryan Blue <
>> rb...@netflix.com>, , Spark Dev List <
>> dev@spark.apache.org>
>> *Subject: *Re: JDBC connector for DataSourceV2
>>
>>
>>
>> To me this looks like refactoring of DS1 JDBC to enable user provided
>> connection factories. In itself a good change, but IMO not DSV2 related.
>>
>>
>>
>> I created a JIRA and added some goals. Please comments/add as relevant.
>>
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-28380
>>
>>
>>
>> JIRA for DataSourceV2 API based JDBC connector.
>>
>> Goals :
>>
>>- Generic connector based on JDBC that supports all databases (min
>>bar is support for all V1 data bases).
>>- Reference implementation and Interface for any specialized JDBC
>>connectors.
>>
>>
>>
>> Regards,
>>
>> Shiv
>>
>>
>>
>> On Sat, Jul 13, 2019 at 2:17 AM Gabor Somogyi 
>> wrote:
>>
>> Hi Guys,
>>
>>
>>
>> Don't know what's the intention exactly here but there is such a PR:
>> https://github.com/apache/spark/pull/22560
>>
>> If that's what we need maybe we can resurrect it. BTW, I'm also
>> interested in...
>>
>>
>>
>> BR,
>>
>> G
>>
>>
>>
>>
>>
>> On Sat, Jul 13, 2019 at 4:09 AM Shiv Prashant Sood <
>> shivprash...@gmail.com> wrote:
>>
>> Thanks all. I can also contribute toward this effort.
>>
>>
>>
>> Regards,
>>
>> Shiv
>>
>> Sent from my iPhone
>>
>>
>> On Jul 12, 2019, at 6:51 PM, Xianyin Xin 
>> wrote:
>>
>> If there’s nobody working on that, I’d like to contribute.
>>
>>
>>
>> Loop in @Gengliang Wang.
>>
>>
>>
>> Xianyin
>>
>>
>>
>> *From: *Ryan Blue 
>> *Reply-To: *
>> *Date: *Saturday, July 13, 2019 at 6:54 AM
>> *To: *Shiv Prashant Sood 
>> *Cc: *Spark Dev List 
>> *Subject: *Re: JDBC connector for DataSourceV2
>>
>>
>>
>> I'm not aware of a JDBC connector effort. It would be great to have
>> someone build one!
>>
>>
>>
>> On Fri, Jul 12, 2019 at 3:33 PM Shiv Prashant Sood <
>> shivprash...@gmail.com> wrote:
>>
>> Can someone please help understand the current Status of DataSource V2
>> based JDBC connector? I see connectors for various file formats in Master,
>> but can't find a JDBC implementation or related JIRA.
>>
>>
>>
>> DatasourceV2 APIs to me look in good shape to attempt a JDBC connector
>> for READ/WRITE path.
>>
>> Thanks & Regards,
>>
>> Shiv
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-15 Thread Dongjoon Hyun
Hi, Apache Spark PMC members.

Can we cut Apache Spark 2.4.4 next Monday (22nd July)?

Bests,
Dongjoon.


On Fri, Jul 12, 2019 at 3:18 PM Dongjoon Hyun 
wrote:

> Thank you, Jacek.
>
> BTW, I added `@private` since we need PMC's help to make an Apache Spark
> release.
>
> Can I get more feedbacks from the other PMC members?
>
> Please me know if you have any concerns (e.g. Release date or Release
> manager?)
>
> As one of the community members, I assumed the followings (if we are on
> schedule).
>
> - 2.4.4 at the end of July
> - 2.3.4 at the end of August (since 2.3.0 was released at the end of
> February 2018)
> - 3.0.0 (possibily September?)
> - 3.1.0 (January 2020?)
>
> Bests,
> Dongjoon.
>
>
> On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Thanks Dongjoon Hyun for stepping up as a release manager!
>> Much appreciated.
>>
>> If there's a volunteer to cut a release, I'm always to support it.
>>
>> In addition, the more frequent releases the better for end users so they
>> have a choice to upgrade and have all the latest fixes or wait. It's their
>> call not ours (when we'd keep them waiting).
>>
>> My big 2 yes'es for the release!
>>
>> Jacek
>>
>>
>> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun,  wrote:
>>
>>> Hi, All.
>>>
>>> Spark 2.4.3 was released two months ago (8th May).
>>>
>>> As of today (9th July), there exist 45 fixes in `branch-2.4` including
>>> the following correctness or blocker issues.
>>>
>>> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
>>> decimals not fitting in long
>>> - SPARK-26045 Error in the spark 2.4 release package with the
>>> spark-avro_2.11 dependency
>>> - SPARK-27798 from_avro can modify variables in other rows in local
>>> mode
>>> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
>>> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
>>> entries
>>> - SPARK-28308 CalendarInterval sub-second part should be padded
>>> before parsing
>>>
>>> It would be great if we can have Spark 2.4.4 before we are going to get
>>> busier for 3.0.0.
>>> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
>>> it next Monday. (15th July).
>>> How do you think about this?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>


Re: JDBC connector for DataSourceV2

2019-07-15 Thread Gabor Somogyi
I've had a look at the jiras and seems like the intention is the same
(correct me if I'm wrong).
I think one is enough and the rest can be closed with duplicate.
We should keep multiple jiras only when the intention is different.

BR,
G


On Mon, Jul 15, 2019 at 6:01 AM Xianyin Xin 
wrote:

> There’s another pr https://github.com/apache/spark/pull/21861 but which
> is based the old V2 APIs.
>
>
>
> We’d better link the JIRAs, SPARK-24907
> , SPARK-25547
> , and SPARK-28380
>  and finalize a plan.
>
>
>
> Xianyin
>
>
>
> *From: *Shiv Prashant Sood 
> *Date: *Sunday, July 14, 2019 at 2:59 AM
> *To: *Gabor Somogyi 
> *Cc: *Xianyin Xin , Ryan Blue <
> rb...@netflix.com>, , Spark Dev List <
> dev@spark.apache.org>
> *Subject: *Re: JDBC connector for DataSourceV2
>
>
>
> To me this looks like refactoring of DS1 JDBC to enable user provided
> connection factories. In itself a good change, but IMO not DSV2 related.
>
>
>
> I created a JIRA and added some goals. Please comments/add as relevant.
>
>
>
> https://issues.apache.org/jira/browse/SPARK-28380
>
>
>
> JIRA for DataSourceV2 API based JDBC connector.
>
> Goals :
>
>- Generic connector based on JDBC that supports all databases (min bar
>is support for all V1 data bases).
>- Reference implementation and Interface for any specialized JDBC
>connectors.
>
>
>
> Regards,
>
> Shiv
>
>
>
> On Sat, Jul 13, 2019 at 2:17 AM Gabor Somogyi 
> wrote:
>
> Hi Guys,
>
>
>
> Don't know what's the intention exactly here but there is such a PR:
> https://github.com/apache/spark/pull/22560
>
> If that's what we need maybe we can resurrect it. BTW, I'm also interested
> in...
>
>
>
> BR,
>
> G
>
>
>
>
>
> On Sat, Jul 13, 2019 at 4:09 AM Shiv Prashant Sood 
> wrote:
>
> Thanks all. I can also contribute toward this effort.
>
>
>
> Regards,
>
> Shiv
>
> Sent from my iPhone
>
>
> On Jul 12, 2019, at 6:51 PM, Xianyin Xin 
> wrote:
>
> If there’s nobody working on that, I’d like to contribute.
>
>
>
> Loop in @Gengliang Wang.
>
>
>
> Xianyin
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *
> *Date: *Saturday, July 13, 2019 at 6:54 AM
> *To: *Shiv Prashant Sood 
> *Cc: *Spark Dev List 
> *Subject: *Re: JDBC connector for DataSourceV2
>
>
>
> I'm not aware of a JDBC connector effort. It would be great to have
> someone build one!
>
>
>
> On Fri, Jul 12, 2019 at 3:33 PM Shiv Prashant Sood 
> wrote:
>
> Can someone please help understand the current Status of DataSource V2
> based JDBC connector? I see connectors for various file formats in Master,
> but can't find a JDBC implementation or related JIRA.
>
>
>
> DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for
> READ/WRITE path.
>
> Thanks & Regards,
>
> Shiv
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>