[VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-09-04 Thread Thomas graves
Hey everyone,

I'd like to call for a vote on SPARK-27495 SPIP: Support Stage level
resource configuration and scheduling

This is for supporting stage level resource configuration and
scheduling.  The basic idea is to allow the user to specify executor
and task resource requirements for each stage to allow the user to
control the resources required at a finer grain. One good example here
is doing some ETL to preprocess your data in one stage and then feed
that data into an ML algorithm (like tensorflow) that would run as a
separate stage.  The ETL could need totally different resource
requirements for the executors/tasks than the ML stage does.

The text for the SPIP is in the jira description:

https://issues.apache.org/jira/browse/SPARK-27495

I split the API and Design parts into a google doc that is linked to
from the jira.

This vote is open until next Fri (Sept 13th).

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

I'll start with my +1

Thanks,
Tom

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why two netty libs?

2019-09-04 Thread Sean Owen
Yes that's right. I don't think Spark's usage of ZK needs any ZK
server, so it's safe to exclude in Spark (at least, so far so good!)

On Wed, Sep 4, 2019 at 8:06 AM Steve Loughran
 wrote:
>
> Zookeeper client is/was netty 3, AFAIK, so if you want to use it for 
> anything, it ends up on the CP
>
> On Tue, Sep 3, 2019 at 5:18 PM Shixiong(Ryan) Zhu  
> wrote:
>>
>> Yep, historical reasons. And Netty 4 is under another namespace, so we can 
>> use Netty 3 and Netty 4 in the same JVM.
>>
>> On Tue, Sep 3, 2019 at 6:15 AM Sean Owen  wrote:
>>>
>>> It was for historical reasons; some other transitive dependencies needed it.
>>> I actually was just able to exclude Netty 3 last week from master.
>>> Spark uses Netty 4.
>>>
>>> On Tue, Sep 3, 2019 at 6:59 AM Jacek Laskowski  wrote:
>>> >
>>> > Hi,
>>> >
>>> > Just noticed that Spark 2.4.x uses two netty deps of different versions. 
>>> > Why?
>>> >
>>> > jars/netty-all-4.1.17.Final.jar
>>> > jars/netty-3.9.9.Final.jar
>>> >
>>> > Shouldn't one be excluded or perhaps shaded?
>>> >
>>> > Pozdrawiam,
>>> > Jacek Laskowski
>>> > 
>>> > https://about.me/JacekLaskowski
>>> > The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>> > The Internals of Spark Structured Streaming 
>>> > https://bit.ly/spark-structured-streaming
>>> > The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>> > Follow me at https://twitter.com/jaceklaskowski
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>> --
>>
>> Best Regards,
>>
>> Ryan

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why two netty libs?

2019-09-04 Thread Steve Loughran
Zookeeper client is/was netty 3, AFAIK, so if you want to use it for
anything, it ends up on the CP

On Tue, Sep 3, 2019 at 5:18 PM Shixiong(Ryan) Zhu 
wrote:

> Yep, historical reasons. And Netty 4 is under another namespace, so we can
> use Netty 3 and Netty 4 in the same JVM.
>
> On Tue, Sep 3, 2019 at 6:15 AM Sean Owen  wrote:
>
>> It was for historical reasons; some other transitive dependencies needed
>> it.
>> I actually was just able to exclude Netty 3 last week from master.
>> Spark uses Netty 4.
>>
>> On Tue, Sep 3, 2019 at 6:59 AM Jacek Laskowski  wrote:
>> >
>> > Hi,
>> >
>> > Just noticed that Spark 2.4.x uses two netty deps of different
>> versions. Why?
>> >
>> > jars/netty-all-4.1.17.Final.jar
>> > jars/netty-3.9.9.Final.jar
>> >
>> > Shouldn't one be excluded or perhaps shaded?
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > 
>> > https://about.me/JacekLaskowski
>> > The Internals of Spark SQL https://bit.ly/spark-sql-internals
>> > The Internals of Spark Structured Streaming
>> https://bit.ly/spark-structured-streaming
>> > The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
>
> Best Regards,
> Ryan
>


Re: Schema inference for nested case class issue

2019-09-04 Thread Sean Owen
user@ is the right place for these types of questions.
As the error says, you have a case class that defines a schema
including columns like 'fix' but these don't appear to be in your
DataFrame. It needs to match.

On Wed, Sep 4, 2019 at 6:44 AM El Houssain ALLAMI
 wrote:
>
> Hi ,
>
> i have nested case class that i wanted to fill with data read  from parquet 
> but i got this error : cannot resolve '`fix`' given input columns: 
> [sizeinternalname, sap_compo08_type ]
>
> case class Fix(a:String,b:String)
> case class Acc(fix:Fix, c:String)
> spark.read.parquet("path").as[Acc]
>
> it does not look inside nested case class
>
> any solution for this issue ?
>
> Many Thanks,
> ELhoussain

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Schema inference for nested case class issue

2019-09-04 Thread El Houssain ALLAMI
Hi ,

i have nested case class that i wanted to fill with data read  from parquet
but i got this error : cannot resolve '`fix`' given input columns:
[sizeinternalname, sap_compo08_type ]

case class Fix(a:String,b:String)
case class Acc(fix:Fix, c:String)
spark.read.parquet("path").as[Acc]

it does not look inside nested case class

any solution for this issue ?

Many Thanks,
ELhoussain


Re: [SS]Kafka EOS transaction timeout solution

2019-09-04 Thread wenxuan Guan
Thanks for your reply.
1. About submit tasks to commit kafka transaction
kafka producer commit transaction is lightweight action, compared with
sending message to kafka, there is little chance to fail in commit
transaction. However to implement EOS, we must consider commit transaction
fail, and we should focus on transaction commit timeout.
2. About transaction commit timeout
I agree with you to notice user. In addition to set `transaction.timeout.ms`
configure and notice user in doc, maybe we can do something to resend data
to kafka, fall back to at-least-once, when find transaction timeout, right?

Genmao Yu  于2019年9月3日周二 上午11:39写道:

> Increasing `transaction.timeout.ms` in both kafka client and server side
> may not be a best but workable solution. In the design, spark will submit
> tasks to all executors to do kafka commit (the second phase in 2PC). This
> will increase the possibility of commit failure. Besides, users may be
> clear that if kafka transaction commit timeout, the exactly-once may fall
> back to at-least-once.
>
> wenxuan Guan  于2019年8月31日周六 下午3:42写道:
>
>> Hi all,
>>
>> I have implement Structured Streaming Kafka sink EOS with Kafka
>> transaction producer, and design sketch is in https://issues.apache.org/jira/browse/SPARK-28908;>SPARK-28908 and
>> pr in https://github.com/apache/spark/pull/25618;>25618.
>> But now I meet a problem as blow.
>>
>> When producer failed to commit transaction after successfully send data
>> to Kafka for some reason, such as kafka broker down, spark job will fail
>> down. After kafka broker recovered, restart the job and transaction will
>> resume. But if the time between transaction commit failure fixed and job
>> restart by job attempt or manually exceed the config `
>> transaction.timeout.ms`, data send by producer will be discard by kafka
>> broker, leading to data loss.
>>
>> My solution is
>> 1.Increase the config `transaction.timeout.ms`.
>>  Set the config from 60 seconds, the default value of `
>> transaction.timeout.ms` in producer, to 15 minutes, the default value of
>> config `transaction.max.timeout.ms` in Kafka broker if user not defined.
>> Because the request will fail with a InvalidTransactionTimeout error if `
>> transaction.timeout.ms` is larger than `transaction.max.timeout.ms`. And
>> if user defined transaction.timeout.ms`, we just check if it is larger
>> enough.
>> 2.Notice user the config `transaction.timeout.ms` in document, and
>> introduce some solution to avoid data loss, such as increase config `
>> transaction.timeout.ms` and `transaction.max.timeout.ms`, and avoid
>> exceed the time.
>>
>> BTW, I just skimmed the code in Flink, and found by default flink set the
>> `transaction.timeout.ms` property in producer config to 1 hour, and
>> notice user to increase `transaction.max.timeout.ms` in doc.
>>
>> Any idea about how to handle this problem?
>>
>> Many Thanks,
>> Wenxuan
>>
>


[VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-04 Thread Gengliang Wang
Hi everyone,

I'd like to call for a vote on SPARK-28885
 "Follow ANSI store
assignment rules in table insertion by default".
When inserting a value into a column with the different data type,
Spark performs type coercion. Currently, we support 3 policies for the
type coercion rules: ANSI, legacy and strict, which can be set via the
option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the type coercion as per ANSI SQL. In
practice, the behavior is mostly the same as PostgreSQL. It disallows
certain unreasonable type conversions such as converting `string` to
`int` and `double` to `boolean`.
2. Legacy: Spark allows the type coercion as long as it is a valid
`Cast`, which is very loose. E.g., converting either `string` to `int`
or `double` to `boolean` is allowed. It is the current behavior in
Spark 2.x for compatibility with Hive.
3. Strict: Spark doesn't allow any possible precision loss or data
truncation in type coercion, e.g., converting either `double` to `int`
or `decimal` to `double` is allowed. The rules are originally for
Dataset encoder. As far as I know, no maintainstream DBMS is using
this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while
V2 uses "Strict". This proposal is to use "ANSI" policy by default for
both V1 and V2 in Spark 3.0.

There was also a DISCUSS thread "Follow ANSI SQL on table insertion"
in the dev mailing list.

This vote is open until next Thurs (Sept. 12nd).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang