Re: [DISCUSS] Spark 2.5 release

2019-09-22 Thread Marco Gaido
I agree with Matei too.

Thanks,
Marco

Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
dongjoon.h...@gmail.com> ha scritto:

> +1 for Matei's suggestion!
>
> Bests,
> Dongjoon.
>
> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia 
> wrote:
>
>> If the goal is to get people to try the DSv2 API and build DSv2 data
>> sources, can we recommend the 3.0-preview release for this? That would get
>> people shifting to 3.0 faster, which is probably better overall compared to
>> maintaining two major versions. There’s not that much else changing in 3.0
>> if you already want to update your Java version.
>>
>> On Sep 21, 2019, at 2:45 PM, Ryan Blue  wrote:
>>
>> > If you insist we shouldn't change the unstable temporary API in 3.x . .
>> .
>>
>> Not what I'm saying at all. I said we should carefully consider whether a
>> breaking change is the right decision in the 3.x line.
>>
>> All I'm suggesting is that we can make a 2.5 release with the feature and
>> an API that is the same as the one in 3.0.
>>
>> > I also don't get this backporting a giant feature to 2.x line
>>
>> I am planning to do this so we can use DSv2 before 3.0 is released. Then
>> we can have a source implementation that works in both 2.x and 3.0 to make
>> the transition easier. Since I'm already doing the work, I'm offering to
>> share it with the community.
>>
>>
>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin  wrote:
>>
>>> Because for example we'd need to move the location of InternalRow,
>>> breaking the package name. If you insist we shouldn't change the unstable
>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>> different from my understanding of the situation when you exposed it, then
>>> I'd say we should gate 3.0 on having a stable row interface.
>>>
>>> I also don't get this backporting a giant feature to 2.x line ... as
>>> suggested by others in the thread, DSv2 would be one of the main reasons
>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>
>>>
>>>
>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue  wrote:
>>>
 Why would that require an incompatible change?

 We *could* make an incompatible change and remove support for
 InternalRow, but I think we would want to carefully consider whether that
 is the right decision. And in any case, we would be able to keep 2.5 and
 3.0 compatible, which is the main goal.

 On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin 
 wrote:

 How would you not make incompatible changes in 3.x? As discussed the
 InternalRow API is not stable and needs to change.

 On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:

 > Making downstream to diverge their implementation heavily between
 minor versions (say, 2.4 vs 2.5) wouldn't be a good experience

 You're right that the API has been evolving in the 2.x line. But, it is
 now reasonably stable with respect to the current feature set and we should
 not need to break compatibility in the 3.x line. Because we have reached
 our goals for the 3.0 release, we can backport at least those features to
 2.x and confidently have an API that works in both a 2.x release and is
 compatible with 3.0, if not 3.1 and later releases as well.

 > I'd rather say preparation of Spark 2.5 should be started after Spark
 3.0 is officially released

 The reason I'm suggesting this is that I'm already going to do the work
 to backport the 3.0 release features to 2.4. I've been asked by several
 people when DSv2 will be released, so I know there is a lot of interest in
 making this available sooner than 3.0. If I'm already doing the work, then
 I'd be happy to share that with the community.

 I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
 about complete so we can easily release the same set of features and API in
 2.5 and 3.0.

 If we decide for some reason to wait until after 3.0 is released, I
 don't know that there is much value in a 2.5. The purpose is to be a step
 toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
 It also wouldn't get these features out any sooner than 3.0, as a 2.5
 release probably would, given the work needed to validate the incompatible
 changes in 3.0.

 > DSv2 change would be the major backward incompatibility which Spark
 2.x users may hesitate to upgrade

 As I pointed out, DSv2 has been changing in the 2.x line, so this is
 expected. I don't think it will need incompatible changes in the 3.x line.

 On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:

 Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
 deal with this as the change made confusion on my PRs...), but 

Re: Documentation on org.apache.spark.sql.functions backend.

2019-09-16 Thread Marco Gaido
Hi Vipul,

I am afraid I cannot help you on that.

Thanks,
Marco

Il giorno lun 16 set 2019 alle ore 10:44 Vipul Rajan 
ha scritto:

> Hi Marco,
>
> That does help. Thanks, for taking the time. I am confused as to how that
> Expression is created. There are methods like eval, nullSafeEval,
> doGenCode. Aren't there any architectural docs that could help with what is
> exactly happening? Reverse engineering seems a bit daunting.
>
> Regards
>
> On Mon, Sep 16, 2019 at 1:36 PM Marco Gaido 
> wrote:
>
>> Hi Vipul,
>>
>> a function is never turned in a logical plan. A function is turned into
>> an Expression. And an Expression can be part of many Logical or Physical
>> Plans.
>> Hope this helps.
>>
>> Thanks,
>> Marco
>>
>> Il giorno lun 16 set 2019 alle ore 08:27 Vipul Rajan <
>> vipul.s.p...@gmail.com> ha scritto:
>>
>>> I am trying to create a function that reads data from Kafka,
>>> communicates with confluent schema registry and decodes avro data with
>>> evolving schemas. I am trying to not create hack-ish patches and to write
>>> proper code that I could maybe even create pull requests for. looking at
>>> the code I have been able to figure out a few things regarding how
>>> expressions are generated and how they help to accomplish what a function
>>> does, but there is still a ton I just cannot wrap my head around.
>>>
>>> I am unable to find any documentation that gets into such nitty gritties
>>> of Spark. *I am writing in hopes to find some help. Do you have any
>>> documentation that explains how a function
>>> (org.apache.spark.sql.function._) is turned into a logical plan?*
>>>
>>


Re: Documentation on org.apache.spark.sql.functions backend.

2019-09-16 Thread Marco Gaido
Hi Vipul,

a function is never turned in a logical plan. A function is turned into an
Expression. And an Expression can be part of many Logical or Physical Plans.
Hope this helps.

Thanks,
Marco

Il giorno lun 16 set 2019 alle ore 08:27 Vipul Rajan 
ha scritto:

> I am trying to create a function that reads data from Kafka, communicates
> with confluent schema registry and decodes avro data with evolving schemas.
> I am trying to not create hack-ish patches and to write proper code that I
> could maybe even create pull requests for. looking at the code I have been
> able to figure out a few things regarding how expressions are generated and
> how they help to accomplish what a function does, but there is still a ton
> I just cannot wrap my head around.
>
> I am unable to find any documentation that gets into such nitty gritties
> of Spark. *I am writing in hopes to find some help. Do you have any
> documentation that explains how a function
> (org.apache.spark.sql.function._) is turned into a logical plan?*
>


Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-28 Thread Marco Gaido
+1

Il giorno mer 28 ago 2019 alle ore 06:31 Wenchen Fan 
ha scritto:

> +1
>
> On Wed, Aug 28, 2019 at 2:43 AM DB Tsai  wrote:
>
>> +1
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>> On Tue, Aug 27, 2019 at 11:31 AM Dongjoon Hyun 
>> wrote:
>> >
>> > +1.
>> >
>> > I also verified SHA/GPG and tested UTs on AdoptOpenJDKu8_222/CentOS6.9
>> with profile
>> > "-Pyarn -Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive
>> -Phive-thriftserver"
>> >
>> > Additionally, JDBC IT also is tested.
>> >
>> > Thank you, Kazuaki!
>> >
>> > Bests,
>> > Dongjoon.
>> >
>> >
>> > On Tue, Aug 27, 2019 at 11:20 AM Sean Owen  wrote:
>> >>
>> >> +1 - license and signature looks OK, the docs look OK, the artifacts
>> >> seem to be in order. Tests passed for me when building from source
>> >> with most common profiles set.
>> >>
>> >> On Mon, Aug 26, 2019 at 3:28 PM Kazuaki Ishizaki 
>> wrote:
>> >> >
>> >> > Please vote on releasing the following candidate as Apache Spark
>> version 2.3.4.
>> >> >
>> >> > The vote is open until August 29th 2PM PST and passes if a majority
>> +1 PMC votes are cast, with
>> >> > a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 2.3.4
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see
>> https://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v2.3.4-rc1 (commit
>> 8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210):
>> >> > https://github.com/apache/spark/tree/v2.3.4-rc1
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found
>> at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> >
>> https://repository.apache.org/content/repositories/orgapachespark-1331/
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-docs/
>> >> >
>> >> > The list of bug fixes going into 2.3.4 can be found at the following
>> URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12344844
>> >> >
>> >> > FAQ
>> >> >
>> >> > =
>> >> > How can I help test this release?
>> >> > =
>> >> >
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate,
>> then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> >> > you can add the staging repository to your projects resolvers and
>> test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===
>> >> > What should happen to JIRA tickets still targeting 2.3.4?
>> >> > ===
>> >> >
>> >> > The current list of open tickets targeted at 2.3.4 can be found at:
>> >> > https://issues.apache.org/jira/projects/SPARKand search for "Target
>> Version/s" = 2.3.4
>> >> >
>> >> > Committers should look at those and triage. Extremely important bug
>> >> > fixes, documentation, and API tweaks that impact compatibility should
>> >> > be worked on immediately. Everything else please retarget to an
>> >> > appropriate release.
>> >> >
>> >> > ==
>> >> > But my bug isn't fixed?
>> >> > ==
>> >> >
>> >> > In order to make timely releases, we will typically not hold the
>> >> > release unless the bug in question is a regression from the previous
>> >> > release. That being said, if there is something which is a regression
>> >> > that has not been correctly targeted please ping me or a committer to
>> >> > help target the issue.
>> >> >
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Unmarking most things as experimental, evolving for 3.0?

2019-08-22 Thread Marco Gaido
Thanks for bringing this out Sean.
+1 from me as well!

Thanks,
Marco

Il giorno gio 22 ago 2019 alle ore 08:21 Dongjoon Hyun <
dongjoon.h...@gmail.com> ha scritto:

> +1 for unmarking old ones (made in `2.3.x` and before).
> Thank you, Sean.
>
> Bests,
> Dongjoon.
>
> On Wed, Aug 21, 2019 at 6:46 PM Sean Owen  wrote:
>
>> There are currently about 130 things marked as 'experimental' in
>> Spark, and some have been around since Spark 1.x. A few may be
>> legitimately still experimental (e.g. barrier mode), but, would it be
>> safe to say most of these annotations should be removed for 3.0?
>>
>> What's the theory for evolving vs experimental -- would almost all of
>> these items from, say, 2.3 and before be considered stable now, de
>> facto? Meaning, if we wouldn't take a breaking change for them after
>> 3.0, seems like they're stable.
>>
>> I can open a PR that removes most of it and see if anything looks
>> wrong, if that's an easy way forward.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Opinions wanted: how much to match PostgreSQL semantics?

2019-07-08 Thread Marco Gaido
Hi Sean,

Thanks for bringing this up. Honestly, my opinion is that Spark should be
fully ANSI SQL compliant. Where ANSI SQL compliance is not an issue, I am
fine following any other DB. IMHO, we won't get anyway 100% compliance with
any DB - postgres in this case (e.g. for decimal operations, we are
following SQLServer, and postgres behaviour would be very hard to meet) -
so I think it is fine that PMC members decide for each feature whether it
is worth to support it or not.

Thanks,
Marco

On Mon, 8 Jul 2019, 20:09 Sean Owen,  wrote:

> See the particular issue / question at
> https://github.com/apache/spark/pull/24872#issuecomment-509108532 and
> the larger umbrella at
> https://issues.apache.org/jira/browse/SPARK-27764 -- Dongjoon rightly
> suggests this is a broader question.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Marco Gaido
Hi Dongjoon,
Thanks for the proposal! I like the idea. Maybe we can extend it to
component too and to some jira labels such as correctness which may be
worth to highlight in PRs too. My only concern is that in many cases JIRAs
are created not very carefully so they may be incorrect at the moment of
the pr creation and it may be updated later: so keeping them in sync may be
an extra effort..

On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:

> Seems like a good idea. Can we test this with a component first?
>
> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>> contributions, we have lots of JIRAs and PRs consequently. One specific
>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>
>> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
>> There are two main benefits:
>> 1. It helps the communication between the contributors and reviewers with
>> more information.
>> (In some cases, some people only visit GitHub to see the PR and
>> commits)
>> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
>> PRs to see a specific type.
>> (For example, the reviewers can see and review 'BUG' PRs first by
>> using `is:open is:pr label:BUG`.)
>>
>> Of course, this can be done automatically without human intervention.
>> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
>> can add the labels from the beginning. If needed, I can volunteer to update
>> the script.
>>
>> To show the demo, I labeled several PRs manually. You can see the result
>> right now in Apache Spark PR page.
>>
>>   - https://github.com/apache/spark/pulls
>>
>> If you're surprised due to those manual activities, I want to apologize
>> for that. I hope we can take advantage of the existing GitHub features to
>> serve Apache Spark community in a way better than yesterday.
>>
>> How do you think about this specific suggestion?
>>
>> Bests,
>> Dongjoon
>>
>> PS. I saw that `Request Review` and `Assign` features are already used
>> for some purposes, but these feature are out of the scope in this email.
>>
>


Re: [Spark SQL]: looking for place operators apply on the dataset / dataframe

2019-03-28 Thread Marco Gaido
Hi,

you can check your execution plan and you can find from there which *Exec
classes are used. Please notice that in case of wholeStageCodegen, its
children operators are executed inside the wholeStageCodegenExec.

Bests,
Marco

Il giorno gio 28 mar 2019 alle ore 15:21 ehsan shams <
ehsan.shams.r...@gmail.com> ha scritto:

> Hi
>
> I would like to know where exactly(which class/function) spark sql will
> apply the operators on dataset / dataframe rows.
> For example by applying the following filter or groupby which class is
> responsible for? And will iterate over the rows to do its operation?
>
> Kind regards
> Ehsan Shams
>
> val df1 = sqlContext.read.format("csv").option("header", 
> "true").load("src/main/resources/Names.csv")
> val df11 = df1.filter("County='Hessen'")
> val df12 = df1.groupBy("County")
>
>


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Marco Gaido
Thanks for this SPIP.
I cannot comment on the docs, but just wanted to highlight one thing. In
page 5 of the SPIP, when we talk about DRA, I see:

"For instance, if each executor consists 4 CPUs and 2 GPUs, and each task
requires 1 CPU and 1GPU, then we shall throw an error on application start
because we shall always have at least 2 idle CPUs per executor"

I am not sure this is a correct behavior. We might have tasks requiring
only CPU running in parallel as well, hence that may make sense. I'd rather
emit a WARN or something similar. Anyway we just said we will keep GPU
scheduling on task level out of scope for the moment, right?

Thanks,
Marco

Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng 
ha scritto:

> Steve, the initial work would focus on GPUs, but we will keep the
> interfaces general to support other accelerators in the future. This was
> mentioned in the SPIP and draft design.
>
> Imran, you should have comment permission now. Thanks for making a pass! I
> don't think the proposed 3.0 features should block Spark 3.0 release
> either. It is just an estimate of what we could deliver. I will update the
> doc to make it clear.
>
> Felix, it would be great if you can review the updated docs and let us
> know your feedback.
>
> ** How about setting a tentative vote closing time to next Tue (Mar 26)?
>
> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid 
> wrote:
>
>> Thanks for sending the updated docs.  Can you please give everyone the
>> ability to comment?  I have some comments, but overall I think this is a
>> good proposal and addresses my prior concerns.
>>
>> My only real concern is that I notice some mention of "must dos" for
>> spark 3.0.  I don't want to make any commitment to holding spark 3.0 for
>> parts of this, I think that is an entirely separate decision.  However I'm
>> guessing this is just a minor wording issue, and you really mean that's a
>> minimal set of features you are aiming for, which is reasonable.
>>
>> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang 
>> wrote:
>>
>>> Hi all,
>>>
>>> I updated the SPIP doc
>>> 
>>> and stories
>>> ,
>>> I hope it now contains clear scope of the changes and enough details for
>>> SPIP vote.
>>> Please review the updated docs, thanks!
>>>
>>> Xiangrui Meng  于2019年3月6日周三 上午8:35写道:
>>>
 How about letting Xingbo make a major revision to the SPIP doc to make
 it clear what proposed are? I like Felix's suggestion to switch to the new
 Heilmeier template, which helps clarify what are proposed and what are not.
 Then let's review the new SPIP and resume the vote.

 On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid 
 wrote:

> OK, I suppose then we are getting bogged down into what a vote on an
> SPIP means then anyway, which I guess we can set aside for now.  With the
> level of detail in this proposal, I feel like there is a reasonable chance
> I'd still -1 the design or implementation.
>
> And the other thing you're implicitly asking the community for is to
> prioritize this feature for continued review and maintenance.  There is
> already work to be done in things like making barrier mode support dynamic
> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  
> I'm
> very concerned about getting spread too thin.
>

> But if this is really just a vote on (1) is better gpu support
> important for spark, in some form, in some release? and (2) is it
> *possible* to do this in a safe way?  then I will vote +0.
>
> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves 
> wrote:
>
>> So to me most of the questions here are implementation/design
>> questions, I've had this issue in the past with SPIP's where I expected 
>> to
>> have more high level design details but was basically told that belongs 
>> in
>> the design jira follow on. This makes me think we need to revisit what a
>> SPIP really need to contain, which should be done in a separate thread.
>> Note personally I would be for having more high level details in it.
>> But the way I read our documentation on a SPIP right now that detail
>> is all optional, now maybe we could argue its based on what reviewers
>> request, but really perhaps we should make the wording of that more
>> required.  thoughts?  We should probably separate that discussion if 
>> people
>> want to talk about that.
>>
>> For this SPIP in particular the reason I +1 it is because it came
>> down to 2 questions:
>>
>> 1) do I think spark should support this -> my answer is yes, I think
>> this would improve spark, users have been requesting both better GPUs

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-02 Thread Marco Gaido
+1, a critical feature for AI/DL!

Il giorno sab 2 mar 2019 alle ore 05:14 Weichen Xu <
weichen...@databricks.com> ha scritto:

> +1, nice feature!
>
> On Sat, Mar 2, 2019 at 6:11 AM Yinan Li  wrote:
>
>> +1
>>
>> On Fri, Mar 1, 2019 at 12:37 PM Tom Graves 
>> wrote:
>>
>>> +1 for the SPIP.
>>>
>>> Tom
>>>
>>> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <
>>> jiangxb1...@gmail.com> wrote:
>>>
>>>
>>> Hi all,
>>>
>>> I want to call for a vote of SPARK-24615
>>> . It improves Spark
>>> by making it aware of GPUs exposed by cluster managers, and hence Spark can
>>> match GPU resources with user task requests properly. The proposal
>>> 
>>>  and production doc
>>> 
>>>  was
>>> made available on dev@ to collect input. Your can also find a design
>>> sketch at SPARK-27005
>>> .
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thank you!
>>>
>>> Xingbo
>>>
>>


Re: SparkThriftServer Authorization design

2019-02-16 Thread Marco Gaido
Is this a feature request or a proposal? If it is the latter, may you
please provide a design doc, so the community can look at it?

Otherwise I think one of the main issues with authorization in STS is that
all the queries are actually run inside the same spark job and hence with
the same user. There are other projects trying to address those
limitations. One of them, for instance, is Livy, where a Thrift server has
been recently introduced in order to overcome some of STS's limitations. So
you might probably want to look at it.

Thanks,
Marco

Il giorno sab 16 feb 2019 alle ore 01:19 t4  ha
scritto:

> Goal is to provide ability to deny/allow access to given tables on a per
> user
> basis (this is the user connecting via jdbc to spark thrift server. ie with
> LDAP creds). ie user bob can see table A but not table B. user mary can see
> table B but not table A.
>
> What are folks thoughts on the approach?
> 1. SqlStdAuth like Hive has already (any reason Spark does not have this
> yet?)
> 2. Apache Ranger
> 3. Cloudera Sentry
> 4. json spec file like 'File Based Authorization' section in
>
> https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/connector/hive-security.rst
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: I want to contribute to Apache Spark.

2019-02-13 Thread Marco Gaido
Hi,

You need no permissions to start contributing to Spark. Just start working
on the JIRAs you want and submit a PR for them. You will be added to the
contributors in JIRA once your PR gets merged and you are assigned the
related JIRA. For more information, please refer to the contributing page
on the website.

Thanks,
Looking forward to see your PRs.
Marco

On Thu, 14 Feb 2019, 06:32 wangfei 
> Hi Guys,
>
> I want to contribute to Apache Spark.
> Would you please give me the permission as a contributor?
> My JIRA ID is feiwang.
> hzfeiwang
> hzfeiw...@163.com
>
> 
> 签名由 网易邮箱大师  定制
>
>


Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-06 Thread Marco Gaido
+1 from me as well.

Il giorno mer 6 feb 2019 alle ore 16:58 Yanbo Liang  ha
scritto:

> +1 for the proposal
>
>
>
> On Thu, Jan 31, 2019 at 12:46 PM Mingjie Tang  wrote:
>
>> +1, this is a very very important feature.
>>
>> Mingjie
>>
>> On Thu, Jan 31, 2019 at 12:42 AM Xiao Li  wrote:
>>
>>> Change my vote from +1 to ++1
>>>
>>> Xiangrui Meng  于2019年1月30日周三 上午6:20写道:
>>>
 Correction: +0 vote doesn't mean "Don't really care". Thanks Ryan for
 the offline reminder! Below is the Apache official interpretation
 
 of fraction values:

 The in-between values are indicative of how strongly the voting
 individual feels. Here are some examples of fractional votes and ways in
 which they might be intended and interpreted:
 +0: 'I don't feel strongly about it, but I'm okay with this.'
 -0: 'I won't get in the way, but I'd rather we didn't do this.'
 -0.5: 'I don't like this idea, but I can't find any rational
 justification for my feelings.'
 ++1: 'Wow! I like this! Let's do it!'
 -0.9: 'I really don't like this, but I'm not going to stand in the way
 if everyone else wants to go ahead with it.'
 +0.9: 'This is a cool idea and i like it, but I don't have time/the
 skills necessary to help out.'


 On Wed, Jan 30, 2019 at 12:31 AM Martin Junghanns
  wrote:

> Hi Dongjoon,
>
> Thanks for the hint! I updated the SPIP accordingly.
>
> I also changed the access permissions for the SPIP and design sketch
> docs so that anyone can comment.
>
> Best,
>
> Martin
> On 29.01.19 18:59, Dongjoon Hyun wrote:
>
> Hi, Xiangrui Meng.
>
> +1 for the proposal.
>
> However, please update the following section for this vote. As we see,
> it seems to be inaccurate because today is Jan. 29th. (Almost February).
> (Since I cannot comment on the SPIP, I replied here.)
>
> Q7. How long will it take?
>
>-
>
>If accepted by the community by the end of December 2018, we
>predict to be feature complete by mid-end March, allowing for QA during
>April 2019, making the SPIP part of the next major Spark release (3.0, 
> ETA
>May, 2019).
>
> Bests,
> Dongjoon.
>
> On Tue, Jan 29, 2019 at 8:52 AM Xiao Li  wrote:
>
>> +1
>>
>> Jules Damji  于2019年1月29日周二 上午8:14写道:
>>
>>> +1 (non-binding)
>>> (Heard their proposed tech-talk at Spark + A.I summit in London.
>>> Well attended & well received.)
>>>
>>> —
>>> Sent from my iPhone
>>> Pardon the dumb thumb typos :)
>>>
>>> On Jan 29, 2019, at 7:30 AM, Denny Lee 
>>> wrote:
>>>
>>> +1
>>>
>>> yay - let's do it!
>>>
>>> On Tue, Jan 29, 2019 at 6:28 AM Xiangrui Meng 
>>> wrote:
>>>
 Hi all,

 I want to call for a vote of SPARK-25994
 . It introduces
 a new DataFrame-based component to Spark, which supports property graph
 construction, Cypher queries, and graph algorithms. The proposal
 
 was made available on user@
 
 and dev@
 
  to
 collect input. You can also find a sketch design doc attached to
 SPARK-26028 .

 The vote will be up for the next 72 hours. Please reply with your
 vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Best,
 Xiangrui

>>>


Re: Self join

2019-01-30 Thread Marco Gaido
Hi all,

this thread got a bit stuck. Hence, if there are no objections, I'd go
ahead with a design doc describing the solution/workaround I mentioned
before. Any concerns?
Thanks,
Marco

Il giorno gio 13 dic 2018 alle ore 18:15 Ryan Blue  ha
scritto:

> Thanks for the extra context, Marco. I thought you were trying to propose
> a solution.
>
> On Thu, Dec 13, 2018 at 2:45 AM Marco Gaido 
> wrote:
>
>> Hi Ryan,
>>
>> My goal with this email thread is to discuss with the community if there
>> are better ideas (as I was told many other people tried to address this).
>> I'd consider this as a brainstorming email thread. Once we have a good
>> proposal, then we can go ahead with a SPIP.
>>
>> Thanks,
>> Marco
>>
>> Il giorno mer 12 dic 2018 alle ore 19:13 Ryan Blue 
>> ha scritto:
>>
>>> Marco,
>>>
>>> I'm actually asking for a design doc that clearly states the problem and
>>> proposes a solution. This is a substantial change and probably should be an
>>> SPIP.
>>>
>>> I think that would be more likely to generate discussion than referring
>>> to PRs or a quick paragraph on the dev list, because the only people that
>>> are looking at it now are the ones already familiar with the problem.
>>>
>>> rb
>>>
>>> On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido 
>>> wrote:
>>>
>>>> Thank you all for your answers.
>>>>
>>>> @Ryan Blue  sure, let me state the problem more
>>>> clearly: imagine you have 2 dataframes with a common lineage (for instance
>>>> one is derived from the other by some filtering or anything you prefer).
>>>> And imagine you want to join these 2 dataframes. Currently, there is a fix
>>>> by Reynold which deduplicates the join condition in case the condition is
>>>> an equality one (please notice that in this case, it doesn't matter which
>>>> one is on the left and which one on the right). But if the condition
>>>> involves other comparisons, such as a ">" or a "<", this would result in an
>>>> analysis error, because the attributes on both sides are the same (eg. you
>>>> have the same id#3 attribute on both sides), and you cannot deduplicate
>>>> them blindly as which one is on a specific side matters.
>>>>
>>>> @Reynold Xin  my proposal was to add a dataset id
>>>> in the metadata of each attribute, so that in this case we can distinguish
>>>> from which dataframe the attribute is coming from, ie. having the
>>>> DataFrames `df1` and `df2` where `df2` is derived from `df1`,
>>>> `df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
>>>> know that the first attribute is taken from `df1` and so it has to be
>>>> resolved using it and the same for the other. But I am open to any approach
>>>> to this problem, if other people have better ideas/suggestions.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>> Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke <
>>>> jornfra...@gmail.com> ha scritto:
>>>>
>>>>> I don’t know your exact underlying business problem,  but maybe a
>>>>> graph solution, such as Spark Graphx meets better your requirements.
>>>>> Usually self-joins are done to address some kind of graph problem (even if
>>>>> you would not describe it as such) and is for these kind of problems much
>>>>> more efficient.
>>>>>
>>>>> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to bring to the attention of a more people a problem which
>>>>> has been there for long, ie, self joins. Currently, we have many troubles
>>>>> with them. This has been reported several times to the community and seems
>>>>> to affect many people, but as of now no solution has been accepted for it.
>>>>>
>>>>> I created a PR some time ago in order to address the problem (
>>>>> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
>>>>> tried to fix this problem too but so far no attempt was successful because
>>>>> there is no clear semantic (
>>>>> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>>>>>
>>>>> So I'd like to propose to discuss here which is the best approach for
>>>>> tackling this issue, which I think would be great to fix for 3.0.0, so if
>>>>> we decide to introduce breaking changes in the design, we can do that.
>>>>>
>>>>> Thoughts on this?
>>>>>
>>>>> Thanks,
>>>>> Marco
>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Welcome Jose Torres as a Spark committer

2019-01-29 Thread Marco Gaido
Congrats, Jose!

Bests,
Marco

Il giorno mer 30 gen 2019 alle ore 03:17 JackyLee  ha
scritto:

> Congrats, Joe!
>
> Best,
> Jacky
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-09 Thread Marco Gaido
Jörn, may you explain a bit more your proposal, please? We are not
modifying the existing decimal datatype. This is how it works now. If you
check the PR, the only difference is how we compute the result for the
divsion operation. The discussion about precision and scale is about: shall
we limit them more then we are doing now? Now we are supporting any scale
<= precision and any precision in the range (1, 38].

Il giorno mer 9 gen 2019 alle ore 09:13 Jörn Franke 
ha scritto:

> Maybe it is better to introduce a new datatype that supports negative
> scale, otherwise the migration and testing efforts for organizations
> running Spark application becomes too large. Of course the current decimal
> will be kept as it is.
>
> Am 07.01.2019 um 15:08 schrieb Marco Gaido :
>
> In general we can say that some datasources allow them, others fail. At
> the moment, we are doing no casting before writing (so we can state so in
> the doc). But since there is ongoing discussion for DSv2, we can maybe add
> a flag/interface there for "negative scale intollerant" DS and try and cast
> before writing to them. What do you think about this?
>
> Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan 
> ha scritto:
>
>> AFAIK parquet spec says decimal scale can't be negative. If we want to
>> officially support negative-scale decimal, we should clearly define the
>> behavior when writing negative-scale decimals to parquet and other data
>> sources. The most straightforward way is to fail for this case, but maybe
>> we can do something better, like casting decimal(1, -20) to decimal(20, 0)
>> before writing.
>>
>> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido 
>> wrote:
>>
>>> Hi Wenchen,
>>>
>>> thanks for your email. I agree adding doc for decimal type, but I am not
>>> sure what you mean speaking of the behavior when writing: we are not
>>> performing any automatic casting before writing; if we want to do that, we
>>> need a design about it I think.
>>>
>>> I am not sure if it makes sense to set a min for it. That would break
>>> backward compatibility (for very weird use case), so I wouldn't do that.
>>>
>>> Thanks,
>>> Marco
>>>
>>> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan 
>>> ha scritto:
>>>
>>>> I think we need to do this for backward compatibility, and according to
>>>> the discussion in the doc, SQL standard allows negative scale.
>>>>
>>>> To do this, I think the PR should also include a doc for the decimal
>>>> type, like the definition of precision and scale(this one
>>>> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale>
>>>> looks pretty good), and the result type of decimal operations, and the
>>>> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to
>>>> decimal(20, 0) before writing).
>>>>
>>>> Another question is, shall we set a min scale? e.g. shall we allow
>>>> decimal(1, -1000)?
>>>>
>>>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> a bit more than one month ago, I sent a proposal for handling properly
>>>>> decimals with negative scales in our operations. This is a long standing
>>>>> problem in our codebase as we derived our rules from Hive and SQLServer
>>>>> where negative scales are forbidden, while in Spark they are not.
>>>>>
>>>>> The discussion has been stale for a while now. No more comments on the
>>>>> design doc:
>>>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm
>>>>> .
>>>>>
>>>>> So I am writing this e-mail in order to check whether there are more
>>>>> comments on it or we can go ahead with the PR.
>>>>>
>>>>> Thanks,
>>>>> Marco
>>>>>
>>>>


Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-09 Thread Marco Gaido
Oracle does the same: "The *scale* must be less than or equal to the
precision." (see
https://docs.oracle.com/javadb/10.6.2.1/ref/rrefsqlj15260.html).

Il giorno mer 9 gen 2019 alle ore 05:31 Wenchen Fan 
ha scritto:

> Some more thoughts. If we support unlimited negative scale, why can't we
> support unlimited positive scale? e.g. 0.0001 can be decimal(1, 4) instead
> of (4, 4). I think we need more references here: how other databases deal
> with decimal type and parse decimal literals?
>
> On Mon, Jan 7, 2019 at 10:36 PM Wenchen Fan  wrote:
>
>> I'm OK with it, i.e. fail the write if there are negative-scale decimals
>> (we need to document it though). We can improve it later in data source v2.
>>
>> On Mon, Jan 7, 2019 at 10:09 PM Marco Gaido 
>> wrote:
>>
>>> In general we can say that some datasources allow them, others fail. At
>>> the moment, we are doing no casting before writing (so we can state so in
>>> the doc). But since there is ongoing discussion for DSv2, we can maybe add
>>> a flag/interface there for "negative scale intollerant" DS and try and cast
>>> before writing to them. What do you think about this?
>>>
>>> Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan 
>>> ha scritto:
>>>
>>>> AFAIK parquet spec says decimal scale can't be negative. If we want to
>>>> officially support negative-scale decimal, we should clearly define the
>>>> behavior when writing negative-scale decimals to parquet and other data
>>>> sources. The most straightforward way is to fail for this case, but maybe
>>>> we can do something better, like casting decimal(1, -20) to decimal(20, 0)
>>>> before writing.
>>>>
>>>> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido 
>>>> wrote:
>>>>
>>>>> Hi Wenchen,
>>>>>
>>>>> thanks for your email. I agree adding doc for decimal type, but I am
>>>>> not sure what you mean speaking of the behavior when writing: we are not
>>>>> performing any automatic casting before writing; if we want to do that, we
>>>>> need a design about it I think.
>>>>>
>>>>> I am not sure if it makes sense to set a min for it. That would break
>>>>> backward compatibility (for very weird use case), so I wouldn't do that.
>>>>>
>>>>> Thanks,
>>>>> Marco
>>>>>
>>>>> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan <
>>>>> cloud0...@gmail.com> ha scritto:
>>>>>
>>>>>> I think we need to do this for backward compatibility, and according
>>>>>> to the discussion in the doc, SQL standard allows negative scale.
>>>>>>
>>>>>> To do this, I think the PR should also include a doc for the decimal
>>>>>> type, like the definition of precision and scale(this one
>>>>>> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale>
>>>>>> looks pretty good), and the result type of decimal operations, and the
>>>>>> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to
>>>>>> decimal(20, 0) before writing).
>>>>>>
>>>>>> Another question is, shall we set a min scale? e.g. shall we allow
>>>>>> decimal(1, -1000)?
>>>>>>
>>>>>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> a bit more than one month ago, I sent a proposal for handling
>>>>>>> properly decimals with negative scales in our operations. This is a long
>>>>>>> standing problem in our codebase as we derived our rules from Hive and
>>>>>>> SQLServer where negative scales are forbidden, while in Spark they are 
>>>>>>> not.
>>>>>>>
>>>>>>> The discussion has been stale for a while now. No more comments on
>>>>>>> the design doc:
>>>>>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm
>>>>>>> .
>>>>>>>
>>>>>>> So I am writing this e-mail in order to check whether there are more
>>>>>>> comments on it or we can go ahead with the PR.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Marco
>>>>>>>
>>>>>>


Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-07 Thread Marco Gaido
In general we can say that some datasources allow them, others fail. At the
moment, we are doing no casting before writing (so we can state so in the
doc). But since there is ongoing discussion for DSv2, we can maybe add a
flag/interface there for "negative scale intollerant" DS and try and cast
before writing to them. What do you think about this?

Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan 
ha scritto:

> AFAIK parquet spec says decimal scale can't be negative. If we want to
> officially support negative-scale decimal, we should clearly define the
> behavior when writing negative-scale decimals to parquet and other data
> sources. The most straightforward way is to fail for this case, but maybe
> we can do something better, like casting decimal(1, -20) to decimal(20, 0)
> before writing.
>
> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido  wrote:
>
>> Hi Wenchen,
>>
>> thanks for your email. I agree adding doc for decimal type, but I am not
>> sure what you mean speaking of the behavior when writing: we are not
>> performing any automatic casting before writing; if we want to do that, we
>> need a design about it I think.
>>
>> I am not sure if it makes sense to set a min for it. That would break
>> backward compatibility (for very weird use case), so I wouldn't do that.
>>
>> Thanks,
>> Marco
>>
>> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan 
>> ha scritto:
>>
>>> I think we need to do this for backward compatibility, and according to
>>> the discussion in the doc, SQL standard allows negative scale.
>>>
>>> To do this, I think the PR should also include a doc for the decimal
>>> type, like the definition of precision and scale(this one
>>> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale>
>>> looks pretty good), and the result type of decimal operations, and the
>>> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to
>>> decimal(20, 0) before writing).
>>>
>>> Another question is, shall we set a min scale? e.g. shall we allow
>>> decimal(1, -1000)?
>>>
>>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> a bit more than one month ago, I sent a proposal for handling properly
>>>> decimals with negative scales in our operations. This is a long standing
>>>> problem in our codebase as we derived our rules from Hive and SQLServer
>>>> where negative scales are forbidden, while in Spark they are not.
>>>>
>>>> The discussion has been stale for a while now. No more comments on the
>>>> design doc:
>>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm
>>>> .
>>>>
>>>> So I am writing this e-mail in order to check whether there are more
>>>> comments on it or we can go ahead with the PR.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>


Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-07 Thread Marco Gaido
Hi Wenchen,

thanks for your email. I agree adding doc for decimal type, but I am not
sure what you mean speaking of the behavior when writing: we are not
performing any automatic casting before writing; if we want to do that, we
need a design about it I think.

I am not sure if it makes sense to set a min for it. That would break
backward compatibility (for very weird use case), so I wouldn't do that.

Thanks,
Marco

Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan 
ha scritto:

> I think we need to do this for backward compatibility, and according to
> the discussion in the doc, SQL standard allows negative scale.
>
> To do this, I think the PR should also include a doc for the decimal type,
> like the definition of precision and scale(this one
> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale>
> looks pretty good), and the result type of decimal operations, and the
> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to
> decimal(20, 0) before writing).
>
> Another question is, shall we set a min scale? e.g. shall we allow
> decimal(1, -1000)?
>
> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido 
> wrote:
>
>> Hi all,
>>
>> a bit more than one month ago, I sent a proposal for handling properly
>> decimals with negative scales in our operations. This is a long standing
>> problem in our codebase as we derived our rules from Hive and SQLServer
>> where negative scales are forbidden, while in Spark they are not.
>>
>> The discussion has been stale for a while now. No more comments on the
>> design doc:
>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm
>> .
>>
>> So I am writing this e-mail in order to check whether there are more
>> comments on it or we can go ahead with the PR.
>>
>> Thanks,
>> Marco
>>
>


Re: Decimals with negative scale

2018-12-19 Thread Marco Gaido
That is feasible, the main point is that negative scales were not really
meant to be there in the first place, so it something which was forgot to
be forbidden, and it is something which the DBs we are drawing our
inspiration from for decimals (mainly SQLServer) do not support.
Honestly, my opinion on this topic is:
 - let's add the support to negative scales in the operations (I have
already a PR out for that, https://github.com/apache/spark/pull/22450);
 - let's reduce our usage of DECIMAL in favor of DOUBLE when parsing
literals, as done by Hive, Presto, DB2, ...; so the number of cases when we
deal with negative scales in anyway small (and we do not have issues with
datasources which don't support them).

Thanks,
Marco


Il giorno mar 18 dic 2018 alle ore 19:08 Reynold Xin 
ha scritto:

> So why can't we just do validation to fail sources that don't support
> negative scale, if it is not supported? This way, we don't need to break
> backward compatibility in anyway and it becomes a strict improvement.
>
>
> On Tue, Dec 18, 2018 at 8:43 AM, Marco Gaido 
> wrote:
>
>> This is at analysis time.
>>
>> On Tue, 18 Dec 2018, 17:32 Reynold Xin >
>>> Is this an analysis time thing or a runtime thing?
>>>
>>> On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> as you may remember, there was a design doc to support operations
>>>> involving decimals with negative scales. After the discussion in the design
>>>> doc, now the related PR is blocked because for 3.0 we have another option
>>>> which we can explore, ie. forbidding negative scales. This is probably a
>>>> cleaner solution, as most likely we didn't want negative scales, but it is
>>>> a breaking change: so we wanted to check the opinion of the community.
>>>>
>>>> Getting to the topic, here there are the 2 options:
>>>> * - Forbidding negative scales*
>>>>   Pros: many sources do not support negative scales (so they can create
>>>> issues); they were something which was not considered as possible in the
>>>> initial implementation, so we get to a more stable situation.
>>>>   Cons: some operations which were supported earlier, won't be working
>>>> anymore. Eg. since our max precision is 38, if the scale cannot be negative
>>>> 1e36 * 1e36 would cause an overflow, while now works fine (producing a
>>>> decimal with negative scale); basically impossible to create a config which
>>>> controls the behavior.
>>>>
>>>>  *- Handling negative scales in operations*
>>>>   Pros: no regressions; we support all the operations we supported on
>>>> 2.x.
>>>>   Cons: negative scales can cause issues in other moments, eg. when
>>>> saving to a data source which doesn't support them.
>>>>
>>>> Looking forward to hear your thoughts,
>>>> Thanks.
>>>> Marco
>>>>
>>>
>


Re: Decimals with negative scale

2018-12-18 Thread Marco Gaido
This is at analysis time.

On Tue, 18 Dec 2018, 17:32 Reynold Xin  Is this an analysis time thing or a runtime thing?
>
> On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido 
> wrote:
>
>> Hi all,
>>
>> as you may remember, there was a design doc to support operations
>> involving decimals with negative scales. After the discussion in the design
>> doc, now the related PR is blocked because for 3.0 we have another option
>> which we can explore, ie. forbidding negative scales. This is probably a
>> cleaner solution, as most likely we didn't want negative scales, but it is
>> a breaking change: so we wanted to check the opinion of the community.
>>
>> Getting to the topic, here there are the 2 options:
>> * - Forbidding negative scales*
>>   Pros: many sources do not support negative scales (so they can create
>> issues); they were something which was not considered as possible in the
>> initial implementation, so we get to a more stable situation.
>>   Cons: some operations which were supported earlier, won't be working
>> anymore. Eg. since our max precision is 38, if the scale cannot be negative
>> 1e36 * 1e36 would cause an overflow, while now works fine (producing a
>> decimal with negative scale); basically impossible to create a config which
>> controls the behavior.
>>
>>  *- Handling negative scales in operations*
>>   Pros: no regressions; we support all the operations we supported on 2.x.
>>   Cons: negative scales can cause issues in other moments, eg. when
>> saving to a data source which doesn't support them.
>>
>> Looking forward to hear your thoughts,
>> Thanks.
>> Marco
>>
>>
>>


Decimals with negative scale

2018-12-18 Thread Marco Gaido
Hi all,

as you may remember, there was a design doc to support operations involving
decimals with negative scales. After the discussion in the design doc, now
the related PR is blocked because for 3.0 we have another option which we
can explore, ie. forbidding negative scales. This is probably a cleaner
solution, as most likely we didn't want negative scales, but it is a
breaking change: so we wanted to check the opinion of the community.

Getting to the topic, here there are the 2 options:
* - Forbidding negative scales*
  Pros: many sources do not support negative scales (so they can create
issues); they were something which was not considered as possible in the
initial implementation, so we get to a more stable situation.
  Cons: some operations which were supported earlier, won't be working
anymore. Eg. since our max precision is 38, if the scale cannot be negative
1e36 * 1e36 would cause an overflow, while now works fine (producing a
decimal with negative scale); basically impossible to create a config which
controls the behavior.

 *- Handling negative scales in operations*
  Pros: no regressions; we support all the operations we supported on 2.x.
  Cons: negative scales can cause issues in other moments, eg. when saving
to a data source which doesn't support them.

Looking forward to hear your thoughts,
Thanks.
Marco


Re: Self join

2018-12-13 Thread Marco Gaido
Hi Ryan,

My goal with this email thread is to discuss with the community if there
are better ideas (as I was told many other people tried to address this).
I'd consider this as a brainstorming email thread. Once we have a good
proposal, then we can go ahead with a SPIP.

Thanks,
Marco

Il giorno mer 12 dic 2018 alle ore 19:13 Ryan Blue  ha
scritto:

> Marco,
>
> I'm actually asking for a design doc that clearly states the problem and
> proposes a solution. This is a substantial change and probably should be an
> SPIP.
>
> I think that would be more likely to generate discussion than referring to
> PRs or a quick paragraph on the dev list, because the only people that are
> looking at it now are the ones already familiar with the problem.
>
> rb
>
> On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido 
> wrote:
>
>> Thank you all for your answers.
>>
>> @Ryan Blue  sure, let me state the problem more
>> clearly: imagine you have 2 dataframes with a common lineage (for instance
>> one is derived from the other by some filtering or anything you prefer).
>> And imagine you want to join these 2 dataframes. Currently, there is a fix
>> by Reynold which deduplicates the join condition in case the condition is
>> an equality one (please notice that in this case, it doesn't matter which
>> one is on the left and which one on the right). But if the condition
>> involves other comparisons, such as a ">" or a "<", this would result in an
>> analysis error, because the attributes on both sides are the same (eg. you
>> have the same id#3 attribute on both sides), and you cannot deduplicate
>> them blindly as which one is on a specific side matters.
>>
>> @Reynold Xin  my proposal was to add a dataset id
>> in the metadata of each attribute, so that in this case we can distinguish
>> from which dataframe the attribute is coming from, ie. having the
>> DataFrames `df1` and `df2` where `df2` is derived from `df1`,
>> `df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
>> know that the first attribute is taken from `df1` and so it has to be
>> resolved using it and the same for the other. But I am open to any approach
>> to this problem, if other people have better ideas/suggestions.
>>
>> Thanks,
>> Marco
>>
>> Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke <
>> jornfra...@gmail.com> ha scritto:
>>
>>> I don’t know your exact underlying business problem,  but maybe a graph
>>> solution, such as Spark Graphx meets better your requirements. Usually
>>> self-joins are done to address some kind of graph problem (even if you
>>> would not describe it as such) and is for these kind of problems much more
>>> efficient.
>>>
>>> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
>>>
>>> Hi all,
>>>
>>> I'd like to bring to the attention of a more people a problem which has
>>> been there for long, ie, self joins. Currently, we have many troubles with
>>> them. This has been reported several times to the community and seems to
>>> affect many people, but as of now no solution has been accepted for it.
>>>
>>> I created a PR some time ago in order to address the problem (
>>> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
>>> tried to fix this problem too but so far no attempt was successful because
>>> there is no clear semantic (
>>> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>>>
>>> So I'd like to propose to discuss here which is the best approach for
>>> tackling this issue, which I think would be great to fix for 3.0.0, so if
>>> we decide to introduce breaking changes in the design, we can do that.
>>>
>>> Thoughts on this?
>>>
>>> Thanks,
>>> Marco
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Self join

2018-12-12 Thread Marco Gaido
Thank you all for your answers.

@Ryan Blue  sure, let me state the problem more clearly:
imagine you have 2 dataframes with a common lineage (for instance one is
derived from the other by some filtering or anything you prefer). And
imagine you want to join these 2 dataframes. Currently, there is a fix by
Reynold which deduplicates the join condition in case the condition is an
equality one (please notice that in this case, it doesn't matter which one
is on the left and which one on the right). But if the condition involves
other comparisons, such as a ">" or a "<", this would result in an analysis
error, because the attributes on both sides are the same (eg. you have the
same id#3 attribute on both sides), and you cannot deduplicate them blindly
as which one is on a specific side matters.

@Reynold Xin  my proposal was to add a dataset id in
the metadata of each attribute, so that in this case we can distinguish
from which dataframe the attribute is coming from, ie. having the
DataFrames `df1` and `df2` where `df2` is derived from `df1`,
`df1.join(df2, df1("a") > df2("a"))` could be resolved because we would
know that the first attribute is taken from `df1` and so it has to be
resolved using it and the same for the other. But I am open to any approach
to this problem, if other people have better ideas/suggestions.

Thanks,
Marco

Il giorno mar 11 dic 2018 alle ore 18:31 Jörn Franke 
ha scritto:

> I don’t know your exact underlying business problem,  but maybe a graph
> solution, such as Spark Graphx meets better your requirements. Usually
> self-joins are done to address some kind of graph problem (even if you
> would not describe it as such) and is for these kind of problems much more
> efficient.
>
> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
>
> Hi all,
>
> I'd like to bring to the attention of a more people a problem which has
> been there for long, ie, self joins. Currently, we have many troubles with
> them. This has been reported several times to the community and seems to
> affect many people, but as of now no solution has been accepted for it.
>
> I created a PR some time ago in order to address the problem (
> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
> tried to fix this problem too but so far no attempt was successful because
> there is no clear semantic (
> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>
> So I'd like to propose to discuss here which is the best approach for
> tackling this issue, which I think would be great to fix for 3.0.0, so if
> we decide to introduce breaking changes in the design, we can do that.
>
> Thoughts on this?
>
> Thanks,
> Marco
>
>


Self join

2018-12-11 Thread Marco Gaido
Hi all,

I'd like to bring to the attention of a more people a problem which has
been there for long, ie, self joins. Currently, we have many troubles with
them. This has been reported several times to the community and seems to
affect many people, but as of now no solution has been accepted for it.

I created a PR some time ago in order to address the problem (
https://github.com/apache/spark/pull/21449), but Wenchen mentioned he tried
to fix this problem too but so far no attempt was successful because there
is no clear semantic (
https://github.com/apache/spark/pull/21449#issuecomment-393554552).

So I'd like to propose to discuss here which is the best approach for
tackling this issue, which I think would be great to fix for 3.0.0, so if
we decide to introduce breaking changes in the design, we can do that.

Thoughts on this?

Thanks,
Marco


Re: Jenkins down?

2018-11-19 Thread Marco Gaido
Thanks Shane!

Il giorno lun 19 nov 2018 alle ore 19:14 shane knapp 
ha scritto:

> alright, we're back and building.
>
> On Mon, Nov 19, 2018 at 10:11 AM shane knapp  wrote:
>
>> thanks for the heads up...  looks like the backup process got wedged.
>> i'll restart jenkins now.
>>
>> On Mon, Nov 19, 2018 at 9:56 AM Sean Owen  wrote:
>>
>>> Jenkins says it's shutting down; I assume shane needs to cycle it.
>>> Note also that the Apache - github sync looks like it is stuck;
>>> nothing has synced since yesterday:
>>> https://github.com/apache/spark/commits/master
>>> That might also be a factor in whatever you're observing.
>>> On Mon, Nov 19, 2018 at 10:53 AM Marco Gaido 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I see that Jenkins is not starting builds for the PRs today. Is it in
>>> maintenance?
>>> >
>>> > Thanks,
>>> > Marco
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Jenkins down?

2018-11-19 Thread Marco Gaido
Hi all,

I see that Jenkins is not starting builds for the PRs today. Is it in
maintenance?

Thanks,
Marco


Re: Is spark.sql.codegen.factoryMode property really for tests only?

2018-11-16 Thread Marco Gaido
Hi Jacek,

I do believe it is correct. Please check the method you mentioned
(CodeGeneratorWithInterpretedFallback.createObject): the value is relevant
only if Utils.isTesting.

Thanks,
Marco

Il giorno ven 16 nov 2018 alle ore 13:28 Jacek Laskowski 
ha scritto:

> Hi,
>
> While reviewing the changes in 2.4 I stumbled
> upon spark.sql.codegen.factoryMode internal configuration property [1]. The
> doc says:
>
> > Note that this config works only for tests.
>
> Is that correct? I've got some doubts.
>
> I found that it's used in UnsafeProjection.create [2] (through
> CodeGeneratorWithInterpretedFallback.createObject) which is used outside
> the tests and so made me think if "this config works only for tests" part
> is correct.
>
> Are my doubts correct? If not, what am I missing? Thanks.
>
> [1]
> https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L758-L767
>
> [2]
> https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L159
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>


Re: Drop support for old Hive in Spark 3.0?

2018-10-26 Thread Marco Gaido
Hi all,

one big problem about getting rid of the Hive fork is the thriftserver,
which relies on the HiveServer from the Hive fork.
We might migrate to an apache/hive dependency, but not sure this would help
that much.
I think a broader topic would be the actual opportunity of having a
thriftserver directly into Spark. It has many well-known limitations (not
fault tolerant, no security/impersonation, etc.etc.) and there are other
project which target to provide a thrift/JDBC interface to Spark. Just to
be clear I am not proposing to remove the thriftserver in 3.0, but maybe it
is something we could evaluate in the long term.

Thanks,
Marco


Il giorno ven 26 ott 2018 alle ore 19:07 Sean Owen  ha
scritto:

> OK let's keep this about Hive.
>
> Right, good point, this is really about supporting metastore versions, and
> there is a good argument for retaining backwards-compatibility with older
> metastores. I don't know how far, but I guess, as far as is practical?
>
> Isn't there still a lot of Hive 0.x test code? is that something that's
> safe to drop for 3.0?
>
> And, basically, what must we do to get rid of the Hive fork? that seems
> like a must-do.
>
>
>
> On Fri, Oct 26, 2018 at 11:51 AM Dongjoon Hyun 
> wrote:
>
>> Hi, Sean and All.
>>
>> For the first question, we support only Hive Metastore from 1.x ~ 2.x.
>> And, we can support Hive Metastore 3.0 simultaneously. Spark is designed
>> like that.
>>
>> I don't think we need to drop old Hive Metastore Support. Is it
>> for avoiding Hive Metastore sharing between Spark2 and Spark3 clusters?
>>
>> I think we should allow that use cases, especially for new Spark 3
>> clusters. How do you think so?
>>
>>
>> For the second question, Apache Spark 2.x doesn't support Hive
>> officially. It's only a best-effort approach in a boundary of Spark.
>>
>>
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality
>>
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#incompatible-hive-udf
>>
>>
>> Not only the documented one, decimal literal(HIVE-17186) makes a query
>> result difference even in the well-known benchmark like TPC-H.
>>
>> Bests,
>> Dongjoon.
>>
>> PS. For Hadoop, let's have another thread if needed. I expect another
>> long story. :)
>>
>>
>> On Fri, Oct 26, 2018 at 7:11 AM Sean Owen  wrote:
>>
>>> Here's another thread to start considering, and I know it's been raised
>>> before.
>>> What version(s) of Hive should Spark 3 support?
>>>
>>> If at least we know it won't include Hive 0.x, could we go ahead and
>>> remove those tests from master? It might significantly reduce the run time
>>> and flakiness.
>>>
>>> It seems that maintaining even the Hive 1.x fork is untenable going
>>> forward, right? does that also imply this support is almost certainly not
>>> maintained in 3.0?
>>>
>>> Per below, it seems like it might even be hard to both support Hive 3
>>> and Hadoop 2 at the same time?
>>>
>>> And while we're at it, what's the + and - for simply only supporting
>>> Hadoop 3 in Spark 3? Is the difference in client / HDFS API even that big?
>>> Or what about focusing only on Hadoop 2.9.x support + 3.x support?
>>>
>>> Lots of questions, just interested now in informal reactions, not a
>>> binding decision.
>>>
>>> On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei 
>>> wrote:
>>>
 Do we really want to switch to Hive 2.3? From this page
 https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop 2.x
 (Hive 3.x works with Hadoop 3.x).

 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 ,
 or mute the thread
 
 .

>>>


[DISCUSS] Support decimals with negative scale in decimal operation

2018-10-25 Thread Marco Gaido
Hi all,

a bit more than one month ago, I sent a proposal for handling properly
decimals with negative scales in our operations. This is a long standing
problem in our codebase as we derived our rules from Hive and SQLServer
where negative scales are forbidden, while in Spark they are not.

The discussion has been stale for a while now. No more comments on the
design doc:
https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm
.

So I am writing this e-mail in order to check whether there are more
comments on it or we can go ahead with the PR.

Thanks,
Marco


Re: Starting to make changes for Spark 3 -- what can we delete?

2018-10-17 Thread Marco Gaido
Hi all,

I think a very big topic on this would be: what do we want to do with the
old mllib API? For long I have been told that it was going to be removed on
3.0. Is this still the plan?

Thanks,
Marco

Il giorno mer 17 ott 2018 alle ore 03:11 Marcelo Vanzin
 ha scritto:

> Might be good to take a look at things marked "@DeveloperApi" and
> whether they should stay that way.
>
> e.g. I was looking at SparkHadoopUtil and I've always wanted to just
> make it private to Spark. I don't see why apps would need any of those
> methods.
> On Tue, Oct 16, 2018 at 10:18 AM Sean Owen  wrote:
> >
> > There was already agreement to delete deprecated things like Flume and
> > Kafka 0.8 support in master. I've got several more on my radar, and
> > wanted to highlight them and solicit general opinions on where we
> > should accept breaking changes.
> >
> > For example how about removing accumulator v1?
> > https://github.com/apache/spark/pull/22730
> >
> > Or using the standard Java Optional?
> > https://github.com/apache/spark/pull/22383
> >
> > Or cleaning up some old workarounds and APIs while at it?
> > https://github.com/apache/spark/pull/22729 (still in progress)
> >
> > I think I talked myself out of replacing Java function interfaces with
> > java.util.function because...
> > https://issues.apache.org/jira/browse/SPARK-25369
> >
> > There are also, say, old json and csv and avro reading method
> > deprecated since 1.4. Remove?
> > Anything deprecated since 2.0.0?
> >
> > Interested in general thoughts on these.
> >
> > Here are some more items targeted to 3.0:
> >
> https://issues.apache.org/jira/browse/SPARK-17875?jql=project%3D%22SPARK%22%20AND%20%22Target%20Version%2Fs%22%3D%223.0.0%22%20ORDER%20BY%20priority%20ASC
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Random sampling in tests

2018-10-08 Thread Marco Gaido
Yes, I see. It makes sense.
Thanks.

Il giorno lun 8 ott 2018 alle ore 16:35 Reynold Xin 
ha scritto:

> Marco - the issue is to reproduce. It is much more annoying for somebody
> else who might not have touched this test case to be able to reproduce the
> error, just given a timezone. It is much easier to just follow some
> documentation saying "please run TEST_SEED=5 build/sbt ~ ".
>
>
> On Mon, Oct 8, 2018 at 4:33 PM Marco Gaido  wrote:
>
>> Hi all,
>>
>> thanks for bringing up the topic Sean. I agree too with Reynold's idea,
>> but in the specific case, if there is an error the timezone is part of the
>> error message.
>> So we know exactly which timezone caused the failure. Hence I thought
>> that logging the seed is not necessary, as we can directly use the failing
>> timezone.
>>
>> Thanks,
>> Marco
>>
>> Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li 
>> ha scritto:
>>
>>> For this specific case, I do not think we should test all the timezone.
>>> If this is fast, I am fine to leave it unchanged. However, this is very
>>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>>> number or just hardcoding some specific time zones.
>>>
>>> In general, I like Reynold’s idea by including the seed value and we add
>>> the seed name in the test case name. This can help us reproduce it.
>>>
>>> Xiao
>>>
>>> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>>>
>>>> I'm personally not a big fan of doing it that way in the PR. It is
>>>> perfectly fine to employ randomized tests, and in this case it might even
>>>> be fine to just pick couple different timezones like the way it happened in
>>>> the PR, but we should:
>>>>
>>>> 1. Document in the code comment why we did it that way.
>>>>
>>>> 2. Use a seed and log the seed, so any test failures can be reproduced
>>>> deterministically. For this one, it'd be better to pick the seed from a
>>>> seed environmental variable. If the env variable is not set, set to a
>>>> random seed.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>>>
>>>>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>>>> that tests a bunch of cases by randomly choosing different subsets of
>>>>> cases to test on each Jenkins run.
>>>>>
>>>>> There's disagreement about whether this is good approach to improving
>>>>> test runtime. Here's a discussion on one that was committed:
>>>>> https://github.com/apache/spark/pull/22631/files#r223190476
>>>>>
>>>>> I'm flagging it for more input.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>


Re: Random sampling in tests

2018-10-08 Thread Marco Gaido
Hi all,

thanks for bringing up the topic Sean. I agree too with Reynold's idea, but
in the specific case, if there is an error the timezone is part of the
error message.
So we know exactly which timezone caused the failure. Hence I thought that
logging the seed is not necessary, as we can directly use the failing
timezone.

Thanks,
Marco

Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li  ha
scritto:

> For this specific case, I do not think we should test all the timezone. If
> this is fast, I am fine to leave it unchanged. However, this is very slow.
> Thus, I even prefer to reducing the tested timezone to a smaller number or
> just hardcoding some specific time zones.
>
> In general, I like Reynold’s idea by including the seed value and we add
> the seed name in the test case name. This can help us reproduce it.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>
>> I'm personally not a big fan of doing it that way in the PR. It is
>> perfectly fine to employ randomized tests, and in this case it might even
>> be fine to just pick couple different timezones like the way it happened in
>> the PR, but we should:
>>
>> 1. Document in the code comment why we did it that way.
>>
>> 2. Use a seed and log the seed, so any test failures can be reproduced
>> deterministically. For this one, it'd be better to pick the seed from a
>> seed environmental variable. If the env variable is not set, set to a
>> random seed.
>>
>>
>>
>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>
>>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> that tests a bunch of cases by randomly choosing different subsets of
>>> cases to test on each Jenkins run.
>>>
>>> There's disagreement about whether this is good approach to improving
>>> test runtime. Here's a discussion on one that was committed:
>>> https://github.com/apache/spark/pull/22631/files#r223190476
>>>
>>> I'm flagging it for more input.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: welcome a new batch of committers

2018-10-03 Thread Marco Gaido
Congrats you all!

Il giorno mer 3 ott 2018 alle ore 11:29 Liang-Chi Hsieh 
ha scritto:

>
> Congratulations to all new committers!
>
>
> rxin wrote
> > Hi all,
> >
> > The Apache Spark PMC has recently voted to add several new committers to
> > the project, for their contributions:
> >
> > - Shane Knapp (contributor to infra)
> > - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
> > - Kazuaki Ishizaki (contributor to Spark SQL)
> > - Xingbo Jiang (contributor to Spark Core and SQL)
> > - Yinan Li (contributor to Spark on Kubernetes)
> > - Takeshi Yamamuro (contributor to Spark SQL)
> >
> > Please join me in welcoming them!
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Marco Gaido
-1, I was able to reproduce SPARK-25538 with the provided data.

Il giorno lun 1 ott 2018 alle ore 09:11 Ted Yu  ha
scritto:

> +1
>
>  Original message 
> From: Denny Lee 
> Date: 9/30/18 10:30 PM (GMT-08:00)
> To: Stavros Kontopoulos 
> Cc: Sean Owen , Wenchen Fan , dev <
> dev@spark.apache.org>
> Subject: Re: [VOTE] SPARK 2.4.0 (RC2)
>
> +1 (non-binding)
>
>
> On Sat, Sep 29, 2018 at 10:24 AM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> +1
>>
>> Stavros
>>
>> On Sat, Sep 29, 2018 at 5:59 AM, Sean Owen  wrote:
>>
>>> +1, with comments:
>>>
>>> There are 5 critical issues for 2.4, and no blockers:
>>> SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
>>> SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs
>>> SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella
>>> SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration
>>> guide
>>> SPARK-25323 ML 2.4 QA: API: Python API coverage
>>>
>>> Xiangrui, is SPARK-25378 important enough we need to get it into 2.4?
>>>
>>> I found two issues resolved for 2.4.1 that got into this RC, so marked
>>> them as resolved in 2.4.0.
>>>
>>> I checked the licenses and notice and they look correct now in source
>>> and binary builds.
>>>
>>> The 2.12 artifacts are as I'd expect.
>>>
>>> I ran all tests for 2.11 and 2.12 and they pass with -Pyarn
>>> -Pkubernetes -Pmesos -Phive -Phadoop-2.7 -Pscala-2.12.
>>>
>>>
>>>
>>>
>>> On Thu, Sep 27, 2018 at 10:00 PM Wenchen Fan 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.0.
>>> >
>>> > The vote is open until October 1 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> > a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.4.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.4.0-rc2 (commit
>>> 42f25f309e91c8cde1814e3720099ac1e64783da):
>>> > https://github.com/apache/spark/tree/v2.4.0-rc2
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc2-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> > https://repository.apache.org/content/repositories/orgapachespark-1287
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc2-docs/
>>> >
>>> > The list of bug fixes going into 2.4.0 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.4.0?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.4.0 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.4.0
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: SPIP: support decimals with negative scale in decimal operation

2018-09-21 Thread Marco Gaido
Hi Wenchen,
Thank you for the clarification. I agree that this is more a bug fix rather
than an improvement. I apologize for the error. Please consider this as a
design doc.

Thanks,
Marco

Il giorno ven 21 set 2018 alle ore 12:04 Wenchen Fan 
ha scritto:

> Hi Marco,
>
> Thanks for sending it! The problem is clearly explained in this email, but
> I would not treat it as a SPIP. It proposes a fix for a very tricky bug,
> and SPIP is usually for new features. Others please correct me if I was
> wrong.
>
> Thanks,
> Wenchen
>
> On Fri, Sep 21, 2018 at 5:47 PM Marco Gaido 
> wrote:
>
>> Hi all,
>>
>> I am writing this e-mail in order to discuss the issue which is reported
>> in SPARK-25454 and according to Wenchen's suggestion I prepared a design
>> doc for it.
>>
>> The problem we are facing here is that our rules for decimals operations
>> are taken from Hive and MS SQL server and they explicitly don't support
>> decimals with negative scales. So the rules we have currently are not meant
>> to deal with negative scales. The issue is that Spark, instead, doesn't
>> forbid negative scales and - indeed - there are cases in which we are
>> producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1,
>> -8)).
>>
>> Having negative scales most likely wasn't really intended. But
>> unfortunately getting rid of them would be a breaking change as many
>> operations working fine currently would not be allowed anymore and would
>> overflow (eg. select 1e36 * 1). As such, this is something I'd
>> definitely agree on doing, but I think we can target only for 3.0.
>>
>> What we can start doing now, instead, is updating our rules in order to
>> handle properly also the case when decimal scales are negative. From my
>> investigation, it turns out that the only operations which has problems
>> with them is Divide.
>>
>> Here you can find the design doc with all the details:
>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing.
>> The document is also linked in SPARK-25454. There is also already a PR with
>> the change: https://github.com/apache/spark/pull/22450.
>>
>> Looking forward to hear your feedback,
>> Thanks.
>> Marco
>>
>


SPIP: support decimals with negative scale in decimal operation

2018-09-21 Thread Marco Gaido
Hi all,

I am writing this e-mail in order to discuss the issue which is reported in
SPARK-25454 and according to Wenchen's suggestion I prepared a design doc
for it.

The problem we are facing here is that our rules for decimals operations
are taken from Hive and MS SQL server and they explicitly don't support
decimals with negative scales. So the rules we have currently are not meant
to deal with negative scales. The issue is that Spark, instead, doesn't
forbid negative scales and - indeed - there are cases in which we are
producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1,
-8)).

Having negative scales most likely wasn't really intended. But
unfortunately getting rid of them would be a breaking change as many
operations working fine currently would not be allowed anymore and would
overflow (eg. select 1e36 * 1). As such, this is something I'd
definitely agree on doing, but I think we can target only for 3.0.

What we can start doing now, instead, is updating our rules in order to
handle properly also the case when decimal scales are negative. From my
investigation, it turns out that the only operations which has problems
with them is Divide.

Here you can find the design doc with all the details:
https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing.
The document is also linked in SPARK-25454. There is also already a PR with
the change: https://github.com/apache/spark/pull/22450.

Looking forward to hear your feedback,
Thanks.
Marco


***UNCHECKED*** Re: Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Marco Gaido
It is not new, it has been there since 2.3.0, so in that case this is not a
blocker. Thanks.

Il giorno mer 19 set 2018 alle ore 09:21 Reynold Xin 
ha scritto:

> We also only block if it is a new regression.
>
> On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao 
> wrote:
>
>> Hi Marco,
>>
>> From my understanding of SPARK-25454, I don't think it is a block issue,
>> it might be an corner case, so personally I don't want to block the release
>> of 2.3.2 because of this issue. The release has been delayed for a long
>> time.
>>
>> Marco Gaido  于2018年9月19日周三 下午2:58写道:
>>
>>> Sorry, I am -1 because of SPARK-25454 which is a regression from 2.2.
>>>
>>> Il giorno mer 19 set 2018 alle ore 03:45 Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> ha scritto:
>>>
>>>> +1.
>>>>
>>>> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>>>> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>>>>
>>>> I hit the following test case failure once during testing, but it's not
>>>> persistent.
>>>>
>>>> KafkaContinuousSourceSuite
>>>> ...
>>>> subscribing topic by name from earliest offsets (failOnDataLoss:
>>>> false) *** FAILED ***
>>>>
>>>> Thank you, Saisai.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
>>>> wrote:
>>>>
>>>>> +1 from my own side.
>>>>>
>>>>> Thanks
>>>>> Saisai
>>>>>
>>>>> Wenchen Fan  于2018年9月18日周二 上午9:34写道:
>>>>>
>>>>>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>>>>>
>>>>>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>>>>>>
>>>>>>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>>>>>>> build from source with most profiles passed for me.
>>>>>>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 2.3.2.
>>>>>>> >
>>>>>>> > The vote is open until September 21 PST and passes if a majority
>>>>>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>> >
>>>>>>> > [ ] +1 Release this package as Apache Spark 2.3.2
>>>>>>> > [ ] -1 Do not release this package because ...
>>>>>>> >
>>>>>>> > To learn more about Apache Spark, please see
>>>>>>> http://spark.apache.org/
>>>>>>> >
>>>>>>> > The tag to be voted on is v2.3.2-rc6 (commit
>>>>>>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>>>>>>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>>>>>>> >
>>>>>>> > The release files, including signatures, digests, etc. can be
>>>>>>> found at:
>>>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>>>>>>> >
>>>>>>> > Signatures used for Spark RCs can be found in this file:
>>>>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>> >
>>>>>>> > The staging repository for this release can be found at:
>>>>>>> >
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1286/
>>>>>>> >
>>>>>>> > The documentation corresponding to this release can be found at:
>>>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>>>>>>> >
>>>>>>> > The list of bug fixes going into 2.3.2 can be found at the
>>>>>>> following URL:
>>>>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>>>>>> >
>>>>>>> >
>>>>>>> > FAQ
>>>>>>> >
>>>>>>> > =
>>>>>>> > How can I help test this release?
>>>>>>> > =
>>>>>>> >

***UNCHECKED*** Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Marco Gaido
Sorry, I am -1 because of SPARK-25454 which is a regression from 2.2.

Il giorno mer 19 set 2018 alle ore 03:45 Dongjoon Hyun <
dongjoon.h...@gmail.com> ha scritto:

> +1.
>
> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>
> I hit the following test case failure once during testing, but it's not
> persistent.
>
> KafkaContinuousSourceSuite
> ...
> subscribing topic by name from earliest offsets (failOnDataLoss:
> false) *** FAILED ***
>
> Thank you, Saisai.
>
> Bests,
> Dongjoon.
>
> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
> wrote:
>
>> +1 from my own side.
>>
>> Thanks
>> Saisai
>>
>> Wenchen Fan  于2018年9月18日周二 上午9:34写道:
>>
>>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>>
>>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>>>
 +1 . Licenses and sigs check out as in previous 2.3.x releases. A
 build from source with most profiles passed for me.
 On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
 wrote:
 >
 > Please vote on releasing the following candidate as Apache Spark
 version 2.3.2.
 >
 > The vote is open until September 21 PST and passes if a majority +1
 PMC votes are cast, with a minimum of 3 +1 votes.
 >
 > [ ] +1 Release this package as Apache Spark 2.3.2
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see http://spark.apache.org/
 >
 > The tag to be voted on is v2.3.2-rc6 (commit
 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
 > https://github.com/apache/spark/tree/v2.3.2-rc6
 >
 > The release files, including signatures, digests, etc. can be found
 at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
 >
 > Signatures used for Spark RCs can be found in this file:
 > https://dist.apache.org/repos/dist/dev/spark/KEYS
 >
 > The staging repository for this release can be found at:
 >
 https://repository.apache.org/content/repositories/orgapachespark-1286/
 >
 > The documentation corresponding to this release can be found at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
 >
 > The list of bug fixes going into 2.3.2 can be found at the following
 URL:
 > https://issues.apache.org/jira/projects/SPARK/versions/12343289
 >
 >
 > FAQ
 >
 > =
 > How can I help test this release?
 > =
 >
 > If you are a Spark user, you can help us test this release by taking
 > an existing Spark workload and running on this release candidate, then
 > reporting any regressions.
 >
 > If you're working in PySpark you can set up a virtual env and install
 > the current RC and see if anything important breaks, in the Java/Scala
 > you can add the staging repository to your projects resolvers and test
 > with the RC (make sure to clean up the artifact cache before/after so
 > you don't end up building with a out of date RC going forward).
 >
 > ===
 > What should happen to JIRA tickets still targeting 2.3.2?
 > ===
 >
 > The current list of open tickets targeted at 2.3.2 can be found at:
 > https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 2.3.2
 >
 > Committers should look at those and triage. Extremely important bug
 > fixes, documentation, and API tweaks that impact compatibility should
 > be worked on immediately. Everything else please retarget to an
 > appropriate release.
 >
 > ==
 > But my bug isn't fixed?
 > ==
 >
 > In order to make timely releases, we will typically not hold the
 > release unless the bug in question is a regression from the previous
 > release. That being said, if there is something which is a regression
 > that has not been correctly targeted please ping me or a committer to
 > help target the issue.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-18 Thread Marco Gaido
Sorry but I am -1 because of what was reported here:
https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104
.
It is a regression unfortunately. Despite the impact is not huge and there
are workarounds, I think we should include the fix in 2.4.0. I created
SPARK-25454 and submitted a PR for it.
Sorry for the trouble.

Il giorno mar 18 set 2018 alle ore 05:23 Holden Karau 
ha scritto:

> Deprecating Py 2 in the 2.4 release probably doesn't belong in the RC vote
> thread. Personally I think we might be a little too late in the game to
> deprecate it in 2.4, but I think calling it out as "soon to be deprecated"
> in the release docs would be sensible to give folks extra time to prepare.
>
> On Mon, Sep 17, 2018 at 2:04 PM Erik Erlandson 
> wrote:
>
>>
>> I have no binding vote but I second Stavros’ recommendation for
>> spark-23200
>>
>> Per parallel threads on Py2 support I would also like to propose
>> deprecating Py2 starting with this 2.4 release
>>
>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>  wrote:
>>
>>> You can log in to https://repository.apache.org and see what's wrong.
>>> Just find that staging repo and look at the messages. In your case it
>>> seems related to your signature.
>>>
>>> failureMessageNo public key: Key with id: () was not able to be
>>> located on http://gpg-keyserver.de/. Upload your public key and try
>>> the operation again.
>>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
>>> wrote:
>>> >
>>> > I confirmed that
>>> https://repository.apache.org/content/repositories/orgapachespark-1285
>>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any
>>> error message during it.
>>> >
>>> > Any insights are appreciated! So that I can fix it in the next RC.
>>> Thanks!
>>> >
>>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>> >>
>>> >> I think one build is enough, but haven't thought it through. The
>>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>>> >> Really, whatever's the easy thing to do.
>>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>>> wrote:
>>> >> >
>>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>>> Scala 2.12?
>>> >> >
>>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>>> wrote:
>>> >> >>
>>> >> >> A few preliminary notes:
>>> >> >>
>>> >> >> Wenchen for some weird reason when I hit your key in gpg --import,
>>> it
>>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>>> verify
>>> >> >> the signature. No issue there really.
>>> >> >>
>>> >> >> The staging repo gives a 404:
>>> >> >>
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>>> >> >>
>>> >> >> The (revamped) licenses are OK, though there are some minor
>>> glitches
>>> >> >> in the final release tarballs (my fault) : there's an extra
>>> directory,
>>> >> >> and the source release has both binary and source licenses. I'll
>>> fix
>>> >> >> that. Not strictly necessary to reject the release over those.
>>> >> >>
>>> >> >> Last, when I check the staging repo I'll get my answer, but, were
>>> you
>>> >> >> able to build 2.12 artifacts as well?
>>> >> >>
>>> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>>> wrote:
>>> >> >> >
>>> >> >> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.0.
>>> >> >> >
>>> >> >> > The vote is open until September 20 PST and passes if a majority
>>> +1 PMC votes are cast, with
>>> >> >> > a minimum of 3 +1 votes.
>>> >> >> >
>>> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>>> >> >> > [ ] -1 Do not release this package because ...
>>> >> >> >
>>> >> >> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >> >> >
>>> >> >> > The tag to be voted on is v2.4.0-rc1 (commit
>>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>>> >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>>> >> >> >
>>> >> >> > The release files, including signatures, digests, etc. can be
>>> found at:
>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>>> >> >> >
>>> >> >> > Signatures used for Spark RCs can be found in this file:
>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >> >> >
>>> >> >> > The staging repository for this release can be found at:
>>> >> >> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >> >
>>> >> >> > The documentation corresponding to this release can be 

Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-22 Thread Marco Gaido
I agree with Saisai. You can also configure log4j to append anywhere else
other than the console. Many companies have their system for collecting and
monitoring logs and they just customize the log4j configuration. I am not
sure how needed this change would be.

Thanks,
Marco

Il giorno mer 22 ago 2018 alle ore 04:31 Saisai Shao 
ha scritto:

> One issue I can think of is that this "moving the driver log" in the
> application end is quite time-consuming, which will significantly delay the
> shutdown. We already suffered such "rename" problem for event log on object
> store, the moving of driver log will make the problem severe.
>
> For a vanilla Spark on yarn client application, I think user could
> redirect the console outputs to log and provides both driver log and yarn
> application log to the customers, this seems not a big overhead.
>
> Just my two cents.
>
> Thanks
> Saisai
>
> Ankur Gupta  于2018年8月22日周三 上午5:19写道:
>
>> Hi all,
>>
>> I want to highlight a problem that we face here at Cloudera and start a
>> discussion on how to go about solving it.
>>
>> *Problem Statement:*
>> Our customers reach out to us when they face problems in their Spark
>> Applications. Those problems can be related to Spark, environment issues,
>> their own code or something else altogether. A lot of times these customers
>> run their Spark Applications in Yarn Client mode, which as we all know,
>> uses a ConsoleAppender to print logs to the console. These customers
>> usually send their Yarn logs to us to troubleshoot. As you may have
>> figured, these logs do not contain driver logs and makes it difficult for
>> us to troubleshoot the issue. In that scenario our customers end up running
>> the application again, piping the output to a log file or using a local log
>> appender and then sending over that file.
>>
>> I believe that there are other users in the community who also face
>> similar problem, where the central team managing Spark clusters face
>> difficulty in helping the end users because they ran their application in
>> shell or yarn client mode (I am not sure what is the equivalent in Mesos).
>>
>> Additionally, there may be teams who want to capture all these logs so
>> that they can be analyzed at some later point in time and the fact that
>> driver logs are not a part of Yarn Logs causes them to capture only partial
>> logs or makes it difficult to capture all the logs.
>>
>> *Proposed Solution:*
>> One "low touch" approach will be to create an ApplicationListener which
>> listens for Application Start and Application End events. On Application
>> Start, this listener will append a Log Appender which writes to a local or
>> remote (eg:hdfs) log file in an application specific directory and moves
>> this to Yarn's Remote Application Dir (or equivalent Mesos Dir) on
>> application end. This way the logs will be available as part of Yarn Logs.
>>
>> I am also interested in hearing about other ideas that the community may
>> have about this. Or if someone has already solved this problem, then I
>> would like them to contribute their solution to the community.
>>
>> Thanks,
>> Ankur
>>
>


Re: sql compile failing with Zinc?

2018-08-14 Thread Marco Gaido
I am not sure, I managed to build successfully using the mvn in the
distribution today.

Il giorno mar 14 ago 2018 alle ore 22:02 Sean Owen  ha
scritto:

> If you're running zinc directly, you can give it more memory with -J-Xmx2g
> or whatever. If you're running ./build/mvn and letting it run zinc we might
> need to increase the memory that it requests in the script.
>
> On Tue, Aug 14, 2018 at 2:56 PM Steve Loughran 
> wrote:
>
>> Is anyone else getting the sql module maven build on master branch
>> failing when you use zinc for incremental builds?
>>
>> [warn]   ^
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at
>> scala.tools.nsc.backend.icode.GenICode$Scope.(GenICode.scala:2225)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterScope(GenICode.scala:1916)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterMethod(GenICode.scala:1901)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:118)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:148)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:98)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)
>>
>> All is well when zinc is disabled, so I've a workaround -its just a very
>> slow workaround
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Marco Gaido
-1, due to SPARK-25051. It is a regression and it is a correctness bug. In
2.3.0/2.3.1 an Analysis exception was thrown, 2.2.* works fine.
I cannot reproduce the issue on current master, but I was able using the
prepared 2.3.2 release.


Il giorno mar 14 ago 2018 alle ore 10:04 Saisai Shao 
ha scritto:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.2.
>
> The vote is open until August 20 PST and passes if a majority +1 PMC votes
> are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.2-rc5 (commit
> 4dc82259d81102e0cb48f4cb2e8075f80d899ac4):
> https://github.com/apache/spark/tree/v2.3.2-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1281/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc5-docs/
>
> The list of bug fixes going into 2.3.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>
> Note. RC4 was cancelled because of one blocking issue SPARK-25084 during
> release preparation.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.2?
> ===
>
> The current list of open tickets targeted at 2.3.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.3.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>


Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-10 Thread Marco Gaido
Hi Makatun,

I think your problem has been solved in
https://issues.apache.org/jira/browse/SPARK-16406 which is going to be in
Spark 2.4.
Please try on the current master, where you should see the problem
disappeared.

Thanks,
Marco

2018-08-09 12:56 GMT+02:00 makatun :

> Here are the images missing in the previous mail. My apologies.
>  timeline.png>
>  readFormat_visualVM_Sampler.jpg>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Marco Gaido
Hi Wenchen,

I think it would be great to consider also
 - SPARK-24598 :
Datatype overflow conditions gives incorrect result

As it is a correctness bug. What do you think?

Thanks,
Marco

2018-07-31 4:01 GMT+02:00 Wenchen Fan :

> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
>
> *High Priority*:
> SPARK-24374 : Support
> Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has
> a few remaining works and I think we should have it in Spark 2.4.
>
> *Middle Priority*:
> SPARK-23899 : Built-in
> SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there
> are a few useful higher-order functions in progress, like `array_except`,
> `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220 : Build
> and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502 : Spark SQL
> reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
>
> SPARK-24882 : data
> source v2 API improvement
> This is to improve the data source v2 API based on what we learned during
> this release. From the migration of existing sources and design of new
> features, we found some problems in the API and want to address them. I
> believe this should be the last significant API change to data source
> v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>
> SPARK-24252 : Add
> catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
>
> SPARK-24768 : Have a
> built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to
> have in 2.4.
>
> SPARK-23243 :
> Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4
> by replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>
>> In theory releases happen on a time-based cadence, so it's pretty much
>> wrap up what's ready by the code freeze and ship it. In practice, the
>> cadence slips frequently, and it's very much a negotiation about what
>> features should push the code freeze out a few weeks every time. So, kind
>> of a hybrid approach here that works OK.
>>
>> Certainly speak up if you think there's something that really needs to
>> get into 2.4. This is that discuss thread.
>>
>> (BTW I updated the page you mention just yesterday, to reflect the plan
>> suggested in this thread.)
>>
>> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
>> wrote:
>>
>>> Shouldn't this be a discuss thread?
>>>
>>> I'm also happy to see more release managers and agree the time is
>>> getting close, but we should see what features are in progress and see how
>>> close things are and propose a date based on that.  Cutting a branch to
>>> soon just creates more work for committers to push to more branches.
>>>
>>>  http://spark.apache.org/versioning-policy.html mentioned the code
>>> freeze and release branch cut mid-august.
>>>
>>>
>>> Tom
>>>
>>>


Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Marco Gaido
Hi all,

I also like this idea very much and I think it may bring also other
performance improvements in the future.

Thanks to everybody who worked on this.

I agree to target this feature for 3.0.

Thanks everybody,
Bests.
Marco

On Tue, 31 Jul 2018, 08:39 Wenchen Fan,  wrote:

> Hi Carson and Yuanjian,
>
> Thanks for contributing to this project and sharing the production use
> cases! I believe the adaptive execution will be a very important feature of
> Spark SQL and will definitely benefit a lot of users.
>
> I went through the design docs and the high-level design totally makes
> sense to me. Since the code freeze of Spark 2.4 is close, I'm afraid we may
> not have enough time to review the code and merge it, how about we target
> this feature to Spark 3.0?
>
> Besides, it would be great if we can have some real benchmark numbers for
> it.
>
> Thanks,
> Wenchen
>
> On Tue, Jul 31, 2018 at 2:26 PM Yuanjian Li 
> wrote:
>
>> Thanks Carson, great note!
>> Actually Baidu has ported this patch in our internal folk. I collected
>> some user cases and performance improve effect during Baidu internal usage
>> of this patch, summarize as following 3 scenario:
>> 1. SortMergeJoin to BroadcastJoin
>> The SortMergeJoin transform to BroadcastJoin over deeply tree node can
>> bring us 50% to 200% boosting on query performance, and this strategy alway
>> hit the BI scenario like join several tables with filter strategy in
>> subquery
>> 2. Long running application or use Spark as a service
>> In this case, long running application refers to the duration of
>> application near 1 hour. Using Spark as a service refers to use spark-shell
>> and keep submit sql or use the service of Spark like Zeppelin, Livy or our
>> internal sql service Baidu BigSQL. In such scenario, all spark jobs share
>> same partition number, so enable AE and add configs about expected task
>> info including data size, row number, min\max partition number and etc,
>> will bring us 50%-100% boosting on performance improvement.
>> 3. GraphFrame jobs
>> The last scenario is the application use GraphFrame, in this case, user
>> has a 2-dimension graph with 1 billion edges, use the connected
>> componentsalgorithm in GraphFrame. With enabling AE, the duration of app
>> reduce from 58min to 32min, almost 100% boosting on performance improvement.
>>
>> The detailed screenshot and config in the JIRA SPARK-23128
>>  attached pdf.
>>
>> Thanks,
>> Yuanjian Li
>>
>> Wang, Carson  于2018年7月28日周六 上午12:49写道:
>>
>>> Dear all,
>>>
>>>
>>>
>>> The initial support of adaptive execution[SPARK-9850
>>> ] in Spark SQL has
>>> been there since Spark 1.6, but there is no more update since then. One of
>>> the key features in adaptive execution is to determine the number of
>>> reducer automatically at runtime. This is a feature required by many Spark
>>> users especially the infrastructure team in many companies, as there are
>>> thousands of queries running on the cluster where the shuffle partition
>>> number may not be set properly for every query. The same shuffle partition
>>> number also doesn’t work well for all stages in a query because each stage
>>> has different input data size. Other features in adaptive execution include
>>> optimizing join strategy at runtime and handling skewed join automatically,
>>> which have not been implemented in Spark.
>>>
>>>
>>>
>>> In the current implementation, an Exchange coordinator is used to
>>> determine the number of post-shuffle partitions for a stage. However,
>>> exchange coordinator is added when Exchange is being added, so it actually
>>> lacks a global picture of all shuffle dependencies of a post-shuffle
>>> stage.  I.e. for 3 tables’ join in a single stage, the same
>>> ExchangeCoordinator should be used in three Exchanges but currently two
>>> separated ExchangeCoordinator will be added. It also adds additional
>>> Exchanges in some cases. So I think it is time to rethink how to better
>>> support adaptive execution in Spark SQL. I have proposed a new approach in
>>> SPARK-23128 . A
>>> document about the idea is described at here
>>> .
>>> The idea about how to changing a sort merge join to a broadcast hash join
>>> at runtime is also described in a separated doc
>>> .
>>>
>>>
>>>
>>>
>>> The docs have been there for a while, and I also had an implementation
>>> based on Spark 2.3 available at
>>> https://github.com/Intel-bigdata/spark-adaptive. The code is split into
>>> 7 PRs labeled with AE2.3-0x if you look at the pull requests. I asked many
>>> partners to evaluate the patch including Baidu, Alibaba, JD.com, etc and
>>> received very good feedback. Baidu also 

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-18 Thread Marco Gaido
+1 (non-binding)

On Wed, 18 Jul 2018, 07:43 Takeshi Yamamuro,  wrote:

> +1 (non-binding)
>
> On Wed, Jul 18, 2018 at 2:41 PM John Zhuge  wrote:
>
>> +1 (non-binding)
>>
>> On Tue, Jul 17, 2018 at 8:06 PM Wenchen Fan  wrote:
>>
>>> +1 (binding). I think this is more clear to both users and developers,
>>> compared to the existing one which only supports append/overwrite and
>>> doesn't work with tables in data source(like JDBC table) well.
>>>
>>> On Wed, Jul 18, 2018 at 2:06 AM Ryan Blue  wrote:
>>>
 +1 (not binding)

 On Tue, Jul 17, 2018 at 10:59 AM Ryan Blue  wrote:

> Hi everyone,
>
> From discussion on the proposal doc and the discussion thread, I think
> we have consensus around the plan to standardize logical write operations
> for DataSourceV2. I would like to call a vote on the proposal.
>
> The proposal doc is here: SPIP: Standardize SQL logical plans
> 
> .
>
> This vote is for the plan in that doc. The related SPIP with APIs to
> create/alter/drop tables will be a separate vote.
>
> Please vote in the next 72 hours:
>
> [+1]: Spark should adopt the SPIP
> [-1]: Spark should not adopt the SPIP because . . .
>
> Thanks for voting, everyone!
>
> --
> Ryan Blue
>


 --
 Ryan Blue

 --
 John Zhuge

>>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-17 Thread Marco Gaido
+1 too

On Tue, 17 Jul 2018, 05:38 Hyukjin Kwon,  wrote:

> +1
>
> 2018년 7월 17일 (화) 오전 7:34, Sean Owen 님이 작성:
>
>> Fix is committed to branches back through 2.2.x, where this test was
>> added.
>>
>> There is still some issue; I'm seeing that archive.apache.org is
>> rate-limiting downloads and frequently returning 503 errors.
>>
>> We can help, I guess, by avoiding testing against non-current releases.
>> Right now we should be testing against 2.3.1, 2.2.2, 2.1.3, right? 2.0.x is
>> now effectively EOL right?
>>
>> I can make that quick change too if everyone's amenable, in order to
>> prevent more failures in this test from master.
>>
>> On Sun, Jul 15, 2018 at 3:51 PM Sean Owen  wrote:
>>
>>> Yesterday I cleaned out old Spark releases from the mirror system --
>>> we're supposed to only keep the latest release from active branches out on
>>> mirrors. (All releases are available from the Apache archive site.)
>>>
>>> Having done so I realized quickly that the
>>> HiveExternalCatalogVersionsSuite relies on the versions it downloads being
>>> available from mirrors. It has been flaky, as sometimes mirrors are
>>> unreliable. I think now it will not work for any versions except 2.3.1,
>>> 2.2.2, 2.1.3.
>>>
>>> Because we do need to clean those releases out of the mirrors soon
>>> anyway, and because they're flaky sometimes, I propose adding logic to the
>>> test to fall back on downloading from the Apache archive site.
>>>
>>> ... and I'll do that right away to unblock
>>> HiveExternalCatalogVersionsSuite runs. I think it needs to be backported to
>>> other branches as they will still be testing against potentially
>>> non-current Spark releases.
>>>
>>> Sean
>>>
>>


Re: [SPARK][SQL][CORE] Running sql-tests

2018-07-03 Thread Marco Gaido
Hi Daniel,
please check
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
.
You should find all your answers in the comments there.

Thanks,
Marco

2018-07-03 19:08 GMT+02:00 dmateusp :

> Hey everyone!
>
> Newbie question,
>
> I'm trying to run the tests under
> spark/sql/core/src/test/resources/sql-tests/inputs/ but I got no luck so
> far
>
> How are they called ? What is even the format of those files ? I've never
> seen testing in the format of the inputs/ and results/ what's the name of
> it
> ? where can I read about it ?
>
> Thanks!
>
> Best regards,
> Daniel
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-29 Thread Marco Gaido
Yes, I'd say so.

2018-06-29 4:43 GMT+02:00 吴晓菊 :

> And it should be generic for HashJoin not only broadcast join, right?
>
>
> Chrysan Wu
> 吴晓菊
> Phone:+86 17717640807
>
>
> 2018-06-29 10:42 GMT+08:00 吴晓菊 :
>
>> Sorry for the mistake. You are right output ordering of broadcast join
>> can be the order of big table in some types of join. I will prepare a PR
>> and let you review later. Thanks a lot!
>>
>>
>> Chrysan Wu
>> 吴晓菊
>> Phone:+86 17717640807
>>
>>
>> 2018-06-29 0:00 GMT+08:00 Wenchen Fan :
>>
>>> SortMergeJoin sorts its children by join key, but broadcast join does
>>> not. I think the output ordering of broadcast join has nothing to do with
>>> join key.
>>>
>>> On Thu, Jun 28, 2018 at 11:28 PM Marco Gaido 
>>> wrote:
>>>
>>>> I think the outputOrdering would be the one of the big table (if any)
>>>> and it wouldn't matter if this involves the join keys or not. Am I wrong?
>>>>
>>>> 2018-06-28 17:01 GMT+02:00 吴晓菊 :
>>>>
>>>>> Thanks for the reply.
>>>>> By looking into the SortMergeJoinExec, I think we can follow what
>>>>> SortMergeJoin do, for some types of join, if the children is ordered on
>>>>> join keys, we can output the ordered join keys as output ordering.
>>>>>
>>>>>
>>>>> Chrysan Wu
>>>>> 吴晓菊
>>>>> Phone:+86 17717640807
>>>>>
>>>>>
>>>>> 2018-06-28 22:53 GMT+08:00 Wenchen Fan :
>>>>>
>>>>>> SortMergeJoin only reports ordering of the join keys, not the output
>>>>>> ordering of any child.
>>>>>>
>>>>>> It seems reasonable to me that broadcast join should respect the
>>>>>> output ordering of the children. Feel free to submit a PR to fix it, 
>>>>>> thanks!
>>>>>>
>>>>>> On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊  wrote:
>>>>>>
>>>>>>> Why we cannot use the output order of big table?
>>>>>>>
>>>>>>>
>>>>>>> Chrysan Wu
>>>>>>> Phone:+86 17717640807
>>>>>>>
>>>>>>>
>>>>>>> 2018-06-28 21:48 GMT+08:00 Marco Gaido :
>>>>>>>
>>>>>>>> The easy answer to this is that SortMergeJoin ensure an
>>>>>>>> outputOrdering, while BroadcastHashJoin doesn't, ie. after running a
>>>>>>>> BroadcastHashJoin you don't know which is going to be the order of the
>>>>>>>> output since nothing enforces it.
>>>>>>>>
>>>>>>>> Hope this helps.
>>>>>>>> Thanks.
>>>>>>>> Marco
>>>>>>>>
>>>>>>>> 2018-06-28 15:46 GMT+02:00 吴晓菊 :
>>>>>>>>
>>>>>>>>>
>>>>>>>>> We see SortMergeJoinExec is implemented with
>>>>>>>>> outputPartitioning while BroadcastHashJoinExec is
>>>>>>>>> only implemented with outputPartitioning. Why is the design?
>>>>>>>>>
>>>>>>>>> Chrysan Wu
>>>>>>>>> Phone:+86 17717640807
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
>


Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Marco Gaido
I think the outputOrdering would be the one of the big table (if any) and
it wouldn't matter if this involves the join keys or not. Am I wrong?

2018-06-28 17:01 GMT+02:00 吴晓菊 :

> Thanks for the reply.
> By looking into the SortMergeJoinExec, I think we can follow what
> SortMergeJoin do, for some types of join, if the children is ordered on
> join keys, we can output the ordered join keys as output ordering.
>
>
> Chrysan Wu
> 吴晓菊
> Phone:+86 17717640807
>
>
> 2018-06-28 22:53 GMT+08:00 Wenchen Fan :
>
>> SortMergeJoin only reports ordering of the join keys, not the output
>> ordering of any child.
>>
>> It seems reasonable to me that broadcast join should respect the output
>> ordering of the children. Feel free to submit a PR to fix it, thanks!
>>
>> On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊  wrote:
>>
>>> Why we cannot use the output order of big table?
>>>
>>>
>>> Chrysan Wu
>>> Phone:+86 17717640807
>>>
>>>
>>> 2018-06-28 21:48 GMT+08:00 Marco Gaido :
>>>
>>>> The easy answer to this is that SortMergeJoin ensure an outputOrdering,
>>>> while BroadcastHashJoin doesn't, ie. after running a BroadcastHashJoin you
>>>> don't know which is going to be the order of the output since nothing
>>>> enforces it.
>>>>
>>>> Hope this helps.
>>>> Thanks.
>>>> Marco
>>>>
>>>> 2018-06-28 15:46 GMT+02:00 吴晓菊 :
>>>>
>>>>>
>>>>> We see SortMergeJoinExec is implemented with
>>>>> outputPartitioning while BroadcastHashJoinExec is only
>>>>> implemented with outputPartitioning. Why is the design?
>>>>>
>>>>> Chrysan Wu
>>>>> Phone:+86 17717640807
>>>>>
>>>>>
>>>>
>>>
>


Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Marco Gaido
The easy answer to this is that SortMergeJoin ensure an outputOrdering,
while BroadcastHashJoin doesn't, ie. after running a BroadcastHashJoin you
don't know which is going to be the order of the output since nothing
enforces it.

Hope this helps.
Thanks.
Marco

2018-06-28 15:46 GMT+02:00 吴晓菊 :

>
> We see SortMergeJoinExec is implemented with outputPartitioning
> while BroadcastHashJoinExec is only implemented with outputPartitioning.
> Why is the design?
>
> Chrysan Wu
> Phone:+86 17717640807
>
>


Re: Time for 2.3.2?

2018-06-28 Thread Marco Gaido
+1 too, I'd consider also to include SPARK-24208 if we can solve it
timely...

2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro :

> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>
> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
> wrote:
>
>> +1
>>
>> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>>
>>> Hi Saisai, that's great! please go ahead!
>>>
>>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>>> wrote:
>>>
 +1, like mentioned by Marcelo, these issues seems quite severe.

 I can work on the release if short of hands :).

 Thanks
 Jerry


 Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:

> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
> for those out.
>
> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>
> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
> wrote:
> > Hi all,
> >
> > Spark 2.3.1 was released just a while ago, but unfortunately we
> discovered
> > and fixed some critical issues afterward.
> >
> > SPARK-24495: SortMergeJoin may produce wrong result.
> > This is a serious correctness bug, and is easy to hit: have
> duplicated join
> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
> and the
> > join is a sort merge join. This bug is only present in Spark 2.3.
> >
> > SPARK-24588: stream-stream join may produce wrong result
> > This is a correctness bug in a new feature of Spark 2.3: the
> stream-stream
> > join. Users can hit this bug if one of the join side is partitioned
> by a
> > subset of the join keys.
> >
> > SPARK-24552: Task attempt numbers are reused when stages are retried
> > This is a long-standing bug in the output committer that may
> introduce data
> > corruption.
> >
> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML
> to
> > access arbitrary files
> > This is a potential security issue if users build access control
> module upon
> > Spark.
> >
> > I think we need a Spark 2.3.2 to address these issues(especially the
> > correctness bugs) ASAP. Any thoughts?
> >
> > Thanks,
> > Wenchen
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Spark issue 20236 - overwrite a partitioned data srouce

2018-06-14 Thread Marco Gaido
Hi Alessandro,


I'd recommend you to check the UTs added in the commit which solved the
issue (ie.
https://github.com/apache/spark/commit/a66fe36cee9363b01ee70e469f1c968f633c5713).
You can use them to try and reproduce the issue.

Thanks,
Marco

2018-06-14 15:57 GMT+02:00 Alessandro Liparoti :

> Good morning,
>
> I am trying to see how this bug affects the write in spark 2.2.0, but I
> cannot reproduce it. Is it ok then using the code
> df.write.mode(SaveMode.Overwrite).insertInto("table_name")
> ?
>
> Thank you,
> *Alessandro Liparoti*
>


Re: Time for 2.1.3

2018-06-13 Thread Marco Gaido
Yes, you're right Herman. Sorry, my bad.

Thanks.
Marco

2018-06-13 14:01 GMT+02:00 Herman van Hövell tot Westerflier <
her...@databricks.com>:

> Isn't this only a problem with Spark 2.3.x?
>
> On Wed, Jun 13, 2018 at 1:57 PM Marco Gaido 
> wrote:
>
>> Hi Marcelo,
>>
>> thanks for bringing this up. Maybe we should consider to include
>> SPARK-24495, as it is causing some queries to return an incorrect result.
>> What do you think?
>>
>> Thanks,
>> Marco
>>
>> 2018-06-13 1:27 GMT+02:00 Marcelo Vanzin :
>>
>>> Hey all,
>>>
>>> There are some fixes that went into 2.1.3 recently that probably
>>> deserve a release. So as usual, please take a look if there's anything
>>> else you'd like on that release, otherwise I'd like to start with the
>>> process by early next week.
>>>
>>> I'll go through jira to see what's the status of things targeted at
>>> that release, but last I checked there wasn't anything on the radar.
>>>
>>> Thanks!
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: Time for 2.1.3

2018-06-13 Thread Marco Gaido
Hi Marcelo,

thanks for bringing this up. Maybe we should consider to include
SPARK-24495, as it is causing some queries to return an incorrect result.
What do you think?

Thanks,
Marco

2018-06-13 1:27 GMT+02:00 Marcelo Vanzin :

> Hey all,
>
> There are some fixes that went into 2.1.3 recently that probably
> deserve a release. So as usual, please take a look if there's anything
> else you'd like on that release, otherwise I'd like to start with the
> process by early next week.
>
> I'll go through jira to see what's the status of things targeted at
> that release, but last I checked there wasn't anything on the radar.
>
> Thanks!
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread Marco Gaido
I'd be against having a new feature in a minor maintenance release. I think
such a release should contain only bugfixes.

2018-05-16 12:11 GMT+02:00 kant kodali :

> Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of
> 2.3.1?
>
> On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin 
> wrote:
>
>> Bummer. People should still feel welcome to test the existing RC so we
>> can rule out other issues.
>>
>> On Tue, May 15, 2018 at 2:04 PM, Xiao Li  wrote:
>> > -1
>> >
>> > We have a correctness bug fix that was merged after 2.3 RC1. It would be
>> > nice to have that in Spark 2.3.1 release.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-24259
>> >
>> > Xiao
>> >
>> >
>> > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark
>> version
>> >> 2.3.1.
>> >>
>> >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
>> >> a majority of at least 3 +1 PMC votes are cast.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 2.3.1
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>> >>
>> >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
>> >> https://github.com/apache/spark/tree/v2.3.0-rc1
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>> >>
>> >> Signatures used for Spark RCs can be found in this file:
>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>
>> >> The staging repository for this release can be found at:
>> >> https://repository.apache.org/content/repositories/orgapache
>> spark-1269/
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>> >>
>> >> The list of bug fixes going into 2.3.1 can be found at the following
>> URL:
>> >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>> >>
>> >> FAQ
>> >>
>> >> =
>> >> How can I help test this release?
>> >> =
>> >>
>> >> If you are a Spark user, you can help us test this release by taking
>> >> an existing Spark workload and running on this release candidate, then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and install
>> >> the current RC and see if anything important breaks, in the Java/Scala
>> >> you can add the staging repository to your projects resolvers and test
>> >> with the RC (make sure to clean up the artifact cache before/after so
>> >> you don't end up building with a out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 2.3.1?
>> >> ===
>> >>
>> >> The current list of open tickets targeted at 2.3.1 can be found at:
>> >> https://s.apache.org/Q3Uo
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility should
>> >> be worked on immediately. Everything else please retarget to an
>> >> appropriate release.
>> >>
>> >> ==
>> >> But my bug isn't fixed?
>> >> ==
>> >>
>> >> In order to make timely releases, we will typically not hold the
>> >> release unless the bug in question is a regression from the previous
>> >> release. That being said, if there is something which is a regression
>> >> that has not been correctly targeted please ping me or a committer to
>> >> help target the issue.
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: parser error?

2018-05-14 Thread Marco Gaido
Yes Takeshi, I agree, I think we can easily fix the warning replacing the *
with +, since the two options are not required.
I will test this fix and create a PR when it is ready.

Thanks,
Marco

2018-05-14 15:08 GMT+02:00 Takeshi Yamamuro :

> IIUC, since the `lateral View*` matches an empty string in optional
> blocks, antlr shows such a warning;
>
> fromClause
> : FROM relation (',' relation)* (pivotClause | lateralView*)?
> ;
>
> http://www.antlr.org/api/JavaTool/org/antlr/v4/tool/
> ErrorType.html#EPSILON_OPTIONAL
>
>
> On Mon, May 14, 2018 at 9:47 PM, Sean Owen  wrote:
>
>> I don't know anything about it directly, but seems like it would have
>> been caused by https://github.com/apache/spark/commit/e3201e165e41f076ec
>> 72175af246d12c0da529cf
>> The "?" in fromClause is what's generating the warning, and it may be
>> ignorable.
>>
>> On Mon, May 14, 2018 at 12:38 AM Reynold Xin  wrote:
>>
>>> Just saw this in one of my PR that's doc only:
>>>
>>> [error] warning(154): SqlBase.g4:400:0: rule fromClause contains an 
>>> optional block with at least one alternative that can match an empty string
>>>
>>>
>>>
>>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: eager execution and debuggability

2018-05-08 Thread Marco Gaido
I am not sure how this is useful. For students, it is important to
understand how Spark works. This can be critical in many decision they have
to take (whether and what to cache for instance) in order to have
performant Spark application. Creating a eager execution probably can help
them having something running more easily, but let them also using Spark
knowing less about how it works, thus they are likely to write worse
application and to have more problems in debugging any kind of problem
which may later (in production) occur (therefore affecting their experience
with the tool).

Moreover, as Ryan also mentioned, there are tools/ways to force the
execution, helping in the debugging phase. So they can achieve without a
big effort the same result, but with a big difference: they are aware of
what is really happening, which may help them later.

Thanks,
Marco

2018-05-08 21:37 GMT+02:00 Ryan Blue :

> At Netflix, we use Jupyter notebooks and consoles for interactive
> sessions. For anyone interested, this mode of interaction is really easy to
> add in Jupyter and PySpark. You would just define a different *repr_html*
> or *repr* method for Dataset that runs a take(10) or take(100) and
> formats the result.
>
> That way, the output of a cell or console execution always causes the
> dataframe to run and get displayed for that immediate feedback. But, there
> is no change to Spark’s behavior because the action is run by the REPL, and
> only when a dataframe is a result of an execution in order to display it.
> Intermediate results wouldn’t be run, but that gives users a way to avoid
> too many executions and would still support method chaining in the
> dataframe API (which would be horrible with an aggressive execution model).
>
> There are ways to do this in JVM languages as well if you are using a
> Scala or Java interpreter (see jvm-repr
> ). This is actually what we do in
> our Spark-based SQL interpreter to display results.
>
> rb
> ​
>
> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers  wrote:
>
>> yeah we run into this all the time with new hires. they will send emails
>> explaining there is an error in the .write operation and they are debugging
>> the writing to disk, focusing on that piece of code :)
>>
>> unrelated, but another frequent cause for confusion is cascading errors.
>> like the FetchFailedException. they will be debugging the reducer task not
>> realizing the error happened before that, and the FetchFailedException is
>> not the root cause.
>>
>>
>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin  wrote:
>>
>>> Similar to the thread yesterday about improving ML/DL integration, I'm
>>> sending another email on what I've learned recently from Spark users. I
>>> recently talked to some educators that have been teaching Spark in their
>>> (top-tier) university classes. They are some of the most important users
>>> for adoption because of the multiplicative effect they have on the future
>>> generation.
>>>
>>> To my surprise the single biggest ask they want is to enable eager
>>> execution mode on all operations for teaching and debuggability:
>>>
>>> (1) Most of the students are relatively new to programming, and they
>>> need multiple iterations to even get the most basic operation right. In
>>> these cases, in order to trigger an error, they would need to explicitly
>>> add actions, which is non-intuitive.
>>>
>>> (2) If they don't add explicit actions to every operation and there is a
>>> mistake, the error pops up somewhere later where an action is triggered.
>>> This is in a different position from the code that causes the problem, and
>>> difficult for students to correlate the two.
>>>
>>> I suspect in the real world a lot of Spark users also struggle in
>>> similar ways as these students. While eager execution is really not
>>> practical in big data, in learning environments or in development against
>>> small, sampled datasets it can be pretty helpful.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Transform plan with scope

2018-04-24 Thread Marco Gaido
Hi Joseph, Herman,

thanks for your answers. The specific rule I was looking at
is FoldablePropagation. If you look at it, what is done is that first a
AttributeMap containing all the possible foldable alias is collected, then
they are replace in the whole plan (it is a bit more complex than this, I
know, this is just a simplification). So in this scenario, if we have two
aliases in completely separated trees we are nonetheless replacing them,
since we have no idea of which is the scope where each of them is available
in.

I know that this specific problem can be solved in a much easier way: we
are facing a weird situation where there are two aliases with the same
exprId and this is not a very good situation (we can just fix it in the
analyzer, as you proposed Herman). But logically, it would make more sense
that a replacement like this is enforced only where an attribute is in
scope (in other places it should not occur) and I think that if we perform
the things as they are logically needed we are less likely to introduce in
weird bugs caused by situations which we never thought as possible (or
which they are not, but they may become). So I was thinking that if such an
operation could be useful also in other places, then probably we could
introduce it: that is the reason of this email thread, understanding if we
need it or if my idea is worthless (in this case, I apologize for wasting
your time).

Yes, Herman, the management of the state is the hardest point. My best idea
so far (but I am not satisfied with it) is to enforce that the state which
is passed from the each child to the parent extends both Growable and
Shrinkable and we pass two functions which for each node return
respectively the items to discard form the state and to add to it. But in
the case we think/decide that such an operation I proposed might be useful,
probably we can spend more time on investigating the best solution (any
suggestion in case would be very welcomed).

Any more thoughts on this?

Thanks for your answers and your time,
Marco

2018-04-24 19:47 GMT+02:00 Herman van Hövell tot Westerflier <
her...@databricks.com>:

> Hi Marco,
>
> In the case of SPARK-24015 we should perhaps fix this in the analyzer. It
> seems that the plan is somewhat invalid.
>
> If you do want to fix it in the optimizer we might be able to fix it using
> an approach for similar to RemoveRedundantAliases (which manually recurses
> down the tree).
>
> As for your proposal we could explore this a little bit. My main question
> would be how would you pass information up the tree (from child to parent),
> and how would you merge such information if there are multiple children? It
> might be kind of painful to generalize.
>
> - Herman
>
> On Tue, Apr 24, 2018 at 7:37 PM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> Is there some transformation we'd want to apply to that tree, but can't
>> because we have no concept of scope? It's already possible for a plan rule
>> to traverse each node's subtree if it wants.
>>
>> On Tue, Apr 24, 2018 at 10:18 AM, Marco Gaido <marcogaid...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> working on SPARK-24051 I realized that currently in the Optimizer and in
>>> all the places where we are transforming a query plan, we are lacking the
>>> context information of what is in scope and what is not.
>>>
>>> Coming back to the ticket, the bug reported in the ticket is caused
>>> mainly by two reasons:
>>>  1 - we have two aliases in different places of the plan;
>>>  2 - (the focus of this email) we apply all the rules globally over the
>>> whole plan, without any notion of scope where something is
>>> reachable/visible or not.
>>>
>>> I will start with an easy example to explain what I mean. If we have a
>>> simple query like:
>>>
>>> select a, b from (
>>>   select 1 as a, 2 as b from table1
>>> union
>>>   select 3 as a, 4 as b from table2) q
>>>
>>> We produce a tree which is logically something like:
>>>
>>> Project0(a, b)
>>> -   Union
>>> --Project1 (a, b)
>>> --- ScanTable1
>>> --Project 2(a, b)
>>> --- ScanTable2
>>>
>>> So when we apply a transformation on Project1 for instance, we have no
>>> information about what is coming from ScanTable1 (or in general any node
>>> which is part of the subtree whose root is Project1): we miss a stateful
>>> transform which allows the children to tell the parent, grandparents, and
>>> so on what is in their scope. This is in particular true for the
>>> Attributes: in a node we have no idea if an Attribute comes from its
>>> subtree (it is in scope) or not.
>>>
>>> So, the point of this email is: do you think in general might be useful
>>> to introduce a way of navigating the tree which allows the children to keep
>>> a state to be used by their parents? Or do you think it is useful in
>>> general to introduce the concept of scope (if an attribute can be accessed
>>> by a node of a plan)?
>>>
>>> Thanks,
>>> Marco
>>>
>>>
>>>
>>
>


Transform plan with scope

2018-04-24 Thread Marco Gaido
Hi all,

working on SPARK-24051 I realized that currently in the Optimizer and in
all the places where we are transforming a query plan, we are lacking the
context information of what is in scope and what is not.

Coming back to the ticket, the bug reported in the ticket is caused mainly
by two reasons:
 1 - we have two aliases in different places of the plan;
 2 - (the focus of this email) we apply all the rules globally over the
whole plan, without any notion of scope where something is
reachable/visible or not.

I will start with an easy example to explain what I mean. If we have a
simple query like:

select a, b from (
  select 1 as a, 2 as b from table1
union
  select 3 as a, 4 as b from table2) q

We produce a tree which is logically something like:

Project0(a, b)
-   Union
--Project1 (a, b)
--- ScanTable1
--Project 2(a, b)
--- ScanTable2

So when we apply a transformation on Project1 for instance, we have no
information about what is coming from ScanTable1 (or in general any node
which is part of the subtree whose root is Project1): we miss a stateful
transform which allows the children to tell the parent, grandparents, and
so on what is in their scope. This is in particular true for the
Attributes: in a node we have no idea if an Attribute comes from its
subtree (it is in scope) or not.

So, the point of this email is: do you think in general might be useful to
introduce a way of navigating the tree which allows the children to keep a
state to be used by their parents? Or do you think it is useful in general
to introduce the concept of scope (if an attribute can be accessed by a
node of a plan)?

Thanks,
Marco


Re: Block Missing Exception while connecting Spark with HDP

2018-04-24 Thread Marco Gaido
Hi Jasbir,

As a first note, please if you are using a vendor distribution, please
contact the vendor for any issue you are facing. This mailing list is for
the community so we focus on the community edition of Spark.

Anyway, the error seems to be quite clear: your file on HDFS has a missing
block. This might happen if you loose a datanode or the block gets
corrupted and there are no more replicas available for that node. The exact
root cause of the problem is hard to tell but anyway you have to
investigate what is going on your HDFS. Spark has nothing to do with this
problem.

Thanks,
Marco

On Tue, 24 Apr 2018, 09:21 Sing, Jasbir,  wrote:

> i am using HDP2.6.3 and 2.6.4 and using the below code –
>
>
>
> 1. Creating sparkContext object
> 2. Reading a text file using – rdd =sc.textFile(“hdfs://
> 192.168.142.129:8020/abc/test1.txt”);
> 3. println(rdd.count);
>
> After executing the 3rd line i am getting the below error –
>
> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
> block: BP-32082187-172.17.0.2-1517480669419:blk_1073742897_2103
> file=/abc/test1.txt
> at
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
> at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:526)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:749)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)
> at java.io.DataInputStream.read(Unknown Source)
> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:206)
> at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:45)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:246)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1595)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1143)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1143)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
> Can you please help me out in this.
>
>
>
> Regards,
>
> Jasbir Singh
>
>
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy. Your privacy is important to us.
> Accenture uses your personal data only in compliance with data protection
> laws. For further information on how Accenture processes your personal
> data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
>
> __
>
> www.accenture.com
>


Re: Accessing Hive Tables in Spark

2018-04-10 Thread Marco Gaido
Hi Tushar,

It seems Spark is not able to access the metastore. It may be because you
are using derby metastases which is maintained locally. Please check all
your configurations and that Spark has access to the hive-site.xml file
with the metastore uri.

Thanks,
Marco

On Tue, 10 Apr 2018, 08:20 Tushar Singhal,  wrote:

> Hi Everyone,
>
> I was accessing Hive Tables in Spark SQL using Scala submitted by
> spark-submit command.
> When I ran in cluster mode then got error like : Table not found
> But the same is working while submitted as client mode.
> Please help me to understand why?
>
> Distribution : Hortonworks
>
> Thanks in advance !!
>
>


Re: time for Apache Spark 3.0?

2018-04-05 Thread Marco Gaido
Hi all,

I also agree with Mark that we should add Java 9/10 support to an eventual
Spark 3.0 release, because supporting Java 9 is not a trivial task since we
are using some internal APIs for the memory management which changed:
either we find a solution which works on both (but I am not sure it is
feasible) or we have to switch between 2 implementations according to the
Java version.
So I'd rather avoid doing this in a non-major release.

Thanks,
Marco


2018-04-05 17:35 GMT+02:00 Mark Hamstra :

> As with Sean, I'm not sure that this will require a new major version, but
> we should also be looking at Java 9 & 10 support -- particularly with
> regard to their better functionality in a containerized environment (memory
> limits from cgroups, not sysconf; support for cpusets). In that regard, we
> should also be looking at using the latest Scala 2.11.x maintenance release
> in current Spark branches.
>
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
>
>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
>>
>>> The primary motivating factor IMO for a major version bump is to support
>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>> Similar to Spark 2.0, I think there are also opportunities for other
>>> changes that we know have been biting us for a long time but can’t be
>>> changed in feature releases (to be clear, I’m actually not sure they are
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>
>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>> simultaneously. The cross-build already works now in 2.3.0. Barring some
>> big change needed to get 2.12 fully working -- and that may be the case --
>> it nearly works that way now.
>>
>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>> in byte code. However Scala itself isn't mutually compatible between 2.11
>> and 2.12 anyway; that's never been promised as compatible.
>>
>> (Interesting question about what *Java* users should expect; they would
>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>
>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>> 2.11 support if needed to make supporting 2.12 less painful.
>>
>
>


Re: 回复: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Marco Gaido
Congrats Zhenhua!

2018-04-02 11:00 GMT+02:00 Saisai Shao :

> Congrats, Zhenhua!
>
> 2018-04-02 16:57 GMT+08:00 Takeshi Yamamuro :
>
>> Congrats, Zhenhua!
>>
>> On Mon, Apr 2, 2018 at 4:13 PM, Ted Yu  wrote:
>>
>>> Congratulations, Zhenhua
>>>
>>>  Original message 
>>> From: 雨中漫步 <601450...@qq.com>
>>> Date: 4/1/18 11:30 PM (GMT-08:00)
>>> To: Yuanjian Li , Wenchen Fan <
>>> cloud0...@gmail.com>
>>> Cc: dev 
>>> Subject: 回复: Welcome Zhenhua Wang as a Spark committer
>>>
>>> Congratulations Zhenhua Wang
>>>
>>>
>>> -- 原始邮件 --
>>> *发件人:* "Yuanjian Li";
>>> *发送时间:* 2018年4月2日(星期一) 下午2:26
>>> *收件人:* "Wenchen Fan";
>>> *抄送:* "Spark dev list";
>>> *主题:* Re: Welcome Zhenhua Wang as a Spark committer
>>>
>>> Congratulations Zhenhua!!
>>>
>>> 2018-04-02 13:28 GMT+08:00 Wenchen Fan :
>>>
 Hi all,

 The Spark PMC recently added Zhenhua Wang as a committer on the
 project. Zhenhua is the major contributor of the CBO project, and has been
 contributing across several areas of Spark for a while, focusing especially
 on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!

 Wenchen

>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


Re: Contributing to Spark

2018-03-12 Thread Marco Gaido
Hi Roman,

welcome to the community. Actually, this is not how it works. If you want
to contribute to Spark you can just look for open JIRAs and submit a PR for
that. JIRAs are assigned by committers once the PR gets merged.
If you want, you can eventually comment on the JIRA that you are working on
it.

Looking forward for your contributions.
Best regards,
Marco

2018-03-12 10:48 GMT+01:00 Roman Maier :

> Hello everyone,
>
> I would like to contribute to Spark.
>
> Can somebody give me the possibility to assign issues in jira?
>
>
>
>
>
> Sincerely,
>
> Roman Maier
>


Re: Welcoming some new committers

2018-03-03 Thread Marco Gaido
Congratulations to you all!

On 3 Mar 2018 8:30 a.m., "Liang-Chi Hsieh"  wrote:

>
> Congrats to everyone!
>
>
> Kazuaki Ishizaki wrote
> > Congratulations to everyone!
> >
> > Kazuaki Ishizaki
> >
> >
> >
> > From:   Takeshi Yamamuro 
>
> > linguin.m.s@
>
> > 
> > To: Spark dev list 
>
> > dev@.apache
>
> > 
> > Date:   2018/03/03 10:45
> > Subject:Re: Welcoming some new committers
> >
> >
> >
> > Congrats, all!
> >
> > On Sat, Mar 3, 2018 at 10:34 AM, Takuya UESHIN 
>
> > ueshin@
>
> > 
> > wrote:
> > Congratulations and welcome!
> >
> > On Sat, Mar 3, 2018 at 10:21 AM, Xingbo Jiang 
>
> > jiangxb1987@
>
> > 
> > wrote:
> > Congratulations to everyone!
> >
> > 2018-03-03 8:51 GMT+08:00 Ilan Filonenko 
>
> > if56@
>
> > :
> > Congrats to everyone! :)
> >
> > On Fri, Mar 2, 2018 at 7:34 PM Felix Cheung 
>
> > felixcheung_m@
>
> > 
> > wrote:
> > Congrats and welcome!
> >
> >
> > From: Dongjoon Hyun 
>
> > dongjoon.hyun@
>
> > 
> > Sent: Friday, March 2, 2018 4:27:10 PM
> > To: Spark dev list
> > Subject: Re: Welcoming some new committers
> >
> > Congrats to all!
> >
> > Bests,
> > Dongjoon.
> >
> > On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan 
>
> > cloud0fan@
>
> >  wrote:
> > Congratulations to everyone and welcome!
> >
> > On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger 
>
> > cody@
>
> >  wrote:
> > Congrats to the new committers, and I appreciate the vote of confidence.
> >
> > On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia 
>
> > matei.zaharia@
>
> > 
> > wrote:
> >> Hi everyone,
> >>
> >> The Spark PMC has recently voted to add several new committers to the
> > project, based on their contributions to Spark 2.3 and other past work:
> >>
> >> - Anirudh Ramanathan (contributor to Kubernetes support)
> >> - Bryan Cutler (contributor to PySpark and Arrow support)
> >> - Cody Koeninger (contributor to streaming and Kafka support)
> >> - Erik Erlandson (contributor to Kubernetes support)
> >> - Matt Cheah (contributor to Kubernetes support and other parts of
> > Spark)
> >> - Seth Hendrickson (contributor to MLlib and PySpark)
> >>
> >> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as
> > committers!
> >>
> >> Matei
> >> -
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >
> > -
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >
> >
> >
> >
> >
> >
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> > --
> > ---
> > Takeshi Yamamuro
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-20 Thread Marco Gaido
+1

2018-02-20 12:30 GMT+01:00 Hyukjin Kwon :

> +1 too
>
> 2018-02-20 14:41 GMT+09:00 Takuya UESHIN :
>
>> +1
>>
>>
>> On Tue, Feb 20, 2018 at 2:14 PM, Xingbo Jiang 
>> wrote:
>>
>>> +1
>>>
>>>
>>> Wenchen Fan 于2018年2月20日 周二下午1:09写道:
>>>
 +1

 On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin 
 wrote:

> +1
>
> On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal ,
> wrote:
>
> this file shouldn't be included? https://dist.apache.org/repos/
>> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>
>
> I've now deleted this file
>
> *From:* Sameer Agarwal 
>> *Sent:* Saturday, February 17, 2018 1:43:39 PM
>> *To:* Sameer Agarwal
>> *Cc:* dev
>> *Subject:* Re: [VOTE] Spark 2.3.0 (RC4)
>>
>> I'll start with a +1 once again.
>>
>> All blockers reported against RC3 have been resolved and the builds
>> are healthy.
>>
>> On 17 February 2018 at 13:41, Sameer Agarwal 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.3.0. The vote is open until Thursday February 22, 2018 at 
>>> 8:00:00
>>> am UTC and passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.0-rc4: https://github.com/apache/spar
>>> k/tree/v2.3.0-rc4 (44095cb65500739695b0324c177c19dfa1471472)
>>>
>>> List of JIRA tickets resolved in this release can be found here:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapache
>>> spark-1265/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs
>>> /_site/index.html
>>>
>>>
>>> FAQ
>>>
>>> ===
>>> What are the unresolved issues targeted for 2.3.0?
>>> ===
>>>
>>> Please see https://s.apache.org/oXKi. At the time of writing, there
>>> are currently no known release blockers.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and
>>> install the current RC and see if anything important breaks, in the
>>> Java/Scala you can add the staging repository to your projects resolvers
>>> and test with the RC (make sure to clean up the artifact cache 
>>> before/after
>>> so you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.0?
>>> ===
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.1 or 
>>> 2.4.0 as
>>> appropriate.
>>>
>>> ===
>>> Why is my bug not fixed?
>>> ===
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from 2.2.0. That 
>>> being
>>> said, if there is something which is a regression from 2.2.0 and has not
>>> been correctly targeted please ping me or a committer to help target the
>>> issue (you can see the open issues listed as impacting Spark 2.3.0 at
>>> https://s.apache.org/WmoI).
>>>
>>
>>
>>
>> --
>> Sameer Agarwal
>> Computer Science | UC Berkeley
>> http://cs.berkeley.edu/~sameerag
>>
>
>
>
> --
> Sameer Agarwal
> Computer Science | UC Berkeley
> http://cs.berkeley.edu/~sameerag
>
>

>>
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> 

Re: There is no space for new record

2018-02-13 Thread Marco Gaido
You can check all the versions where the fix is available on the
JIRA SPARK-23376. Anyway it will be available in the upcoming 2.3.0 release.

Thanks.

On 13 Feb 2018 9:09 a.m., "SNEHASISH DUTTA" 
wrote:

> Hi,
>
> In which version of Spark will this fix  be available ?
> The deployment is on EMR
>
> Regards,
> Snehasish
>
> On Fri, Feb 9, 2018 at 8:51 PM, Wenchen Fan  wrote:
>
>> It should be fixed by https://github.com/apache/spark/pull/20561 soon.
>>
>> On Fri, Feb 9, 2018 at 6:16 PM, Wenchen Fan  wrote:
>>
>>> This has been reported before: http://apache-spark-de
>>> velopers-list.1001551.n3.nabble.com/java-lang-IllegalStateEx
>>> ception-There-is-no-space-for-new-record-tc20108.html
>>>
>>> I think we may have a real bug here, but we need a reproduce. Can you
>>> provide one? thanks!
>>>
>>> On Fri, Feb 9, 2018 at 5:59 PM, SNEHASISH DUTTA <
>>> info.snehas...@gmail.com> wrote:
>>>
 Hi ,

 I am facing the following when running on EMR

 Caused by: java.lang.IllegalStateException: There is no space for new
 record
 at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemoryS
 orter.insertRecord(UnsafeInMemorySorter.java:226)
 at org.apache.spark.sql.execution.UnsafeKVExternalSorter.
 (UnsafeKVExternalSorter.java:132)
 at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMa
 p.destructAndCreateExternalSorter(UnsafeFixedWidthAggregatio
 nMap.java:250)

 I am using pyspark 2.2 , what spark configuration should be
 changed/modified to get this resolved


 Regards,
 Snehasish


 Regards,
 Snehasish

 On Fri, Feb 9, 2018 at 1:26 PM, SNEHASISH DUTTA <
 info.snehas...@gmail.com> wrote:

> Hi ,
>
> I am facing the following when running on EMR
>
> Caused by: java.lang.IllegalStateException: There is no space for new
> record
> at org.apache.spark.util.collecti
> on.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMem
> orySorter.java:226)
> at org.apache.spark.sql.execution
> .UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:132)
> at org.apache.spark.sql.execution
> .UnsafeFixedWidthAggregationMap.destructAndCreateExternalSor
> ter(UnsafeFixedWidthAggregationMap.java:250)
>
> I am using spark 2.2 , what spark configuration should be
> changed/modified to get this resolved
>
>
> Regards,
> Snehasish
>


>>>
>>
>


BroadcastHashJoinExec cleanup

2018-01-29 Thread Marco Gaido
Hello,

looking at BroadcastHashJoinExec, it seems to me that it never destroys the
broadcasted variables. And I think this can cause problems like SPARK-22575.

Anyway, when I tried to add a "cleanup" to destroy the variable, I saw some
test failure because it was trying to access a the destroyed broadcasted
variable.

I think that the reason of this relies in BroadcastExchangeExec, where the
same broadcasted relation can be provided if there are 2 or more
invocations.

Then my questions are: first of all, am I right or am I missing something?
If I am right, in which cases a BroadcastExchangeExec can be used more than
once (I can't think of any)?

Thanks,
Marco


Re: Failing Spark Unit Tests

2018-01-23 Thread Marco Gaido
I tried doing a change for it, but I was unable to reproduce. Anyway, I am
seeing some unrelated errors in other PRs too, so there might be (or might
have been) something wrong at some point. But I'd expect the test to pass
locally anyway.

2018-01-23 15:23 GMT+01:00 Sean Owen :

> That's odd. The current master build is failing for unrelated reasons
> (Jenkins jobs keep getting killed) so it's possible a very recent change
> did break something, though they would have had to pass tests in the PR
> builder first. You can go ahead and open a PR for your change and see what
> the PR builder tests say.
>
> On Tue, Jan 23, 2018 at 4:42 AM Yacine Mazari  wrote:
>
>> Hi All,
>>
>> I am currently working on  SPARK-23166
>>   , but after running
>> "./dev/run-tests", the Python unit tests (supposedly unrelated to my
>> change)
>> are failing for the following reason:
>>
>> 
>> ===
>> File "/home/yacine/spark/python/pyspark/ml/linalg/__init__.py", line
>> 895, in
>> __main__.DenseMatrix.__str__
>> Failed example:
>> print(dm)
>> Expected:
>> DenseMatrix([[ 0.,  2.],
>>  [ 1.,  3.]])
>> Got:
>> DenseMatrix([[0., 2.],
>>  [1., 3.]])
>> 
>> ===
>>
>> Notice that the missing space in the output is causing the failure.
>>
>> Any hints what is causing this? Are there any specific version of Python
>> and/or other libraries I should be using?
>>
>> Thanks.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Join Strategies

2018-01-13 Thread Marco Gaido
Hi dev,

I have a question about how join strategies are defined.

I see that CartesianProductExec is used only for InnerJoin, while for other
kind of joins BroadcastNestedLoopJoinExec is used.
For reference:
https://github.com/apache/spark/blob/cd9f49a2aed3799964976ead06080a0f7044a0c3/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L260

May you kindly explain me why this is done? It doesn't seem a great choice
to me, since BroadcastNestedLoopJoinExec can fail with OOM.

Thanks,
Marco


R: Decimals

2017-12-21 Thread Marco Gaido
Thanks for your answer Xiao. The point is that behaving like this is against 
SQL standard and is different also from Hive's behavior. Then I would propose 
to add a configuration flag to switch between the two behaviors, either being 
SQL compliant and Hive compliant or behaving like now (as Hermann was 
suggesting in the PR). Do we agree on this way? If so, is there any way to read 
a configuration property in the catalyst project?

Thank you,
Marco

- Messaggio originale -
Da: "Xiao Li" <gatorsm...@gmail.com>
Inviato: ‎21/‎12/‎2017 22:46
A: "Marco Gaido" <marcogaid...@gmail.com>
Cc: "Reynold Xin" <r...@databricks.com>; "dev@spark.apache.org" 
<dev@spark.apache.org>
Oggetto: Re: Decimals

Losing precision is not acceptable to financial customers. Thus, instead of 
returning NULL, I saw DB2 issues the following error message:


SQL0802N  Arithmetic overflow or other arithmetic exception occurred.  

SQLSTATE=22003


DB2 on z/OS is being used by most of biggest banks and financial intuitions 
since 1980s. Either issuing exceptions (what DB2 does) or returning NULL (what 
we are doing) looks fine to me. If they want to avoid getting NULL or 
exceptions, users should manually putting the round functions by themselves. 


Also see the technote of DB2 zOS: 
http://www-01.ibm.com/support/docview.wss?uid=swg21161024












2017-12-19 8:41 GMT-08:00 Marco Gaido <marcogaid...@gmail.com>:

Hello everybody,


I did some further researches and now I am sharing my findings. I am sorry, it 
is going to be a quite long e-mail, but I'd really appreciate some feedbacks 
when you have time to read it.


Spark's current implementation of arithmetic operations on decimals was 
"copied" from Hive. Thus, the initial goal of the implementation was to be 
compliant with Hive, which itself aims to reproduce SQLServer behavior. 
Therefore I compared these 3 DBs and of course I checked the SQL ANSI standard 
2011 (you can find it at 
http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip)
 and a late draft of the standard 2003 
(http://www.wiscorp.com/sql_2003_standard.zip). The main topics are 3:
how to determine the precision and scale of a result;
how to behave when the result is a number which is not representable exactly 
with the result's precision and scale (ie. requires precision loss);
how to behave when the result is out of the range of the representable values 
with the result's precision and scale (ie. it is bigger of the biggest number 
representable or lower the lowest one).
Currently, Spark behaves like follows:
It follows some rules taken from intial Hive implementation;
it returns NULL;
it returns NULL.


The SQL ANSI is pretty clear about points 2 and 3, while it says barely nothing 
about point 1, I am citing SQL ANSI:2011 page 27:


If the result cannot be represented exactly in the result type, then whether it 
is rounded
or truncated is implementation-defined. An exception condition is raised if the 
result is
outside the range of numeric values of the result type, or if the arithmetic 
operation
is not defined for the operands.


Then, as you can see, Spark is not respecting the SQL standard neither for 
point 2 and 3. Someone, then might argue that we need compatibility with Hive. 
Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's behavior 
is:
Rules are a bit changed, to reflect SQLServer implementation as described in 
this blog 
(https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/);
It rounds the result;
It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI standard and 
throw an Exception).
As far as the other DBs are regarded, there is little to say about Oracle and 
Postgres, since they have a nearly infinite precision, thus it is hard also to 
test the behavior in these conditions, but SQLServer has the same precision as 
Hive and Spark. Thus, this is SQLServer behavior:
Rules should be the same as Hive, as described on their post (tests about the 
behavior confirm);
It rounds the result;
It throws an Exception.
Therefore, since I think that Spark should be compliant to SQL ANSI (first) and 
Hive, I propose the following changes:
Update the rules to derive the result type in order to reflect new Hive's one 
(which are SQLServer's one);
Change Spark behavior to round the result, as done by Hive and SQLServer and 
prescribed by the SQL standard;
Change Spark's behavior, introducing a configuration parameter in order to 
determine whether to return null or throw an Exception (by default I propose to 
throw an exception in order to be compliant with the SQL standard, which IMHO 
is more important that being compliant with Hive).
For 1 and 2, I prepared a PR, which is 
https://github.com/apache/spark/pull/20023. For 3, I'd love to get your 
feedbacks in order to agree on what to do and then I wi

Re: Decimals

2017-12-19 Thread Marco Gaido
Hello everybody,

I did some further researches and now I am sharing my findings. I am sorry,
it is going to be a quite long e-mail, but I'd really appreciate some
feedbacks when you have time to read it.

Spark's current implementation of arithmetic operations on decimals was
"copied" from Hive. Thus, the initial goal of the implementation was to be
compliant with Hive, which itself aims to reproduce SQLServer behavior.
Therefore I compared these 3 DBs and of course I checked the SQL ANSI
standard 2011 (you can find it at
http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip)
and a late draft of the standard 2003 (
http://www.wiscorp.com/sql_2003_standard.zip). The main topics are 3:

   1. how to determine the precision and scale of a result;
   2. how to behave when the result is a number which is not representable
   exactly with the result's precision and scale (ie. requires precision loss);
   3. how to behave when the result is out of the range of the
   representable values with the result's precision and scale (ie. it is
   bigger of the biggest number representable or lower the lowest one).

Currently, Spark behaves like follows:

   1. It follows some rules taken from intial Hive implementation;
   2. it returns NULL;
   3. it returns NULL.


The SQL ANSI is pretty clear about points 2 and 3, while it says barely
nothing about point 1, I am citing SQL ANSI:2011 page 27:

If the result cannot be represented exactly in the result type, then
> whether it is rounded
> or truncated is implementation-defined. An exception condition is raised
> if the result is
> outside the range of numeric values of the result type, or if the
> arithmetic operation
> is not defined for the operands.


Then, as you can see, Spark is not respecting the SQL standard neither for
point 2 and 3. Someone, then might argue that we need compatibility with
Hive. Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's
behavior is:

   1. Rules are a bit changed, to reflect SQLServer implementation as
   described in this blog (
   
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/
   );
   2. It rounds the result;
   3. It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI
   standard and throw an Exception).

As far as the other DBs are regarded, there is little to say about Oracle
and Postgres, since they have a nearly infinite precision, thus it is hard
also to test the behavior in these conditions, but SQLServer has the same
precision as Hive and Spark. Thus, this is SQLServer behavior:

   1. Rules should be the same as Hive, as described on their post (tests
   about the behavior confirm);
   2. It rounds the result;
   3. It throws an Exception.

Therefore, since I think that Spark should be compliant to SQL ANSI (first)
and Hive, I propose the following changes:

   1. Update the rules to derive the result type in order to reflect new
   Hive's one (which are SQLServer's one);
   2. Change Spark behavior to round the result, as done by Hive and
   SQLServer and prescribed by the SQL standard;
   3. Change Spark's behavior, introducing a configuration parameter in
   order to determine whether to return null or throw an Exception (by default
   I propose to throw an exception in order to be compliant with the SQL
   standard, which IMHO is more important that being compliant with Hive).

For 1 and 2, I prepared a PR, which is
https://github.com/apache/spark/pull/20023. For 3, I'd love to get your
feedbacks in order to agree on what to do and then I will eventually do a
PR which reflect what decided here by the community.
I would really love to get your feedback either here or on the PR.

Thanks for your patience and your time reading this long email,
Best regards.
Marco


2017-12-13 9:08 GMT+01:00 Reynold Xin <r...@databricks.com>:

> Responses inline
>
> On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido <marcogaid...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I saw in these weeks that there are a lot of problems related to decimal
>> values (SPARK-22036, SPARK-22755, for instance). Some are related to
>> historical choices, which I don't know, thus please excuse me if I am
>> saying dumb things:
>>
>>  - why are we interpreting literal constants in queries as Decimal and
>> not as Double? I think it is very unlikely that a user can enter a number
>> which is beyond Double precision.
>>
>
> Probably just to be consistent with some popular databases.
>
>
>
>>  - why are we returning null in case of precision loss? Is this approach
>> better than just giving a result which might loose some accuracy?
>>
>
> The contract with decimal is that it should never lose precision (it is
> created for financial reports, accounting, etc). Returning null is at least
> telling the user the data type can no longer support the precision required.
>
>
>
>>
>> Thanks,
>> Marco
>>
>
>


Decimals

2017-12-12 Thread Marco Gaido
Hi all,

I saw in these weeks that there are a lot of problems related to decimal
values (SPARK-22036, SPARK-22755, for instance). Some are related to
historical choices, which I don't know, thus please excuse me if I am
saying dumb things:

 - why are we interpreting literal constants in queries as Decimal and not
as Double? I think it is very unlikely that a user can enter a number which
is beyond Double precision.
 - why are we returning null in case of precision loss? Is this approach
better than just giving a result which might loose some accuracy?

Thanks,
Marco


Re: Some Spark MLLIB tests failing due to some classes not being registered with Kryo

2017-11-11 Thread Marco Gaido
Hi Jorge,

then try running the tests not from the mllib folder, but on Spark base
directory.
If you want to run only the tests in mllib, you can specify the project
using the -pl argument of mvn.

Thanks,
Marco



2017-11-11 13:37 GMT+01:00 Jorge Sánchez <jorgesg1...@gmail.com>:

> Hi Marco,
>
> Just mvn test from the mllib folder.
>
> Thank you.
>
> 2017-11-11 12:36 GMT+00:00 Marco Gaido <marcogaid...@gmail.com>:
>
>> Hi Jorge,
>>
>> how are you running those tests?
>>
>> Thanks,
>> Marco
>>
>> 2017-11-11 13:21 GMT+01:00 Jorge Sánchez <jorgesg1...@gmail.com>:
>>
>>> Hi Dev,
>>>
>>> I'm running the MLLIB tests in the current Master branch and the
>>> following Suites are failing due to some classes not being registered with
>>> Kryo:
>>>
>>> org.apache.spark.mllib.MatricesSuite
>>> org.apache.spark.mllib.VectorsSuite
>>> org.apache.spark.ml.InstanceSuite
>>>
>>> I can solve it by registering the failing classes with Kryo, but I'm
>>> wondering if I'm missing something as these tests shouldn't be failing from
>>> Master.
>>>
>>> Any suggestions on what I may be doing wrong?
>>>
>>> Thank you.
>>>
>>
>>
>


Re: Timeline for Spark 2.3

2017-11-10 Thread Marco Gaido
I would love too to have SPARK-18016. I think it would help a lot of users.

2017-11-10 5:58 GMT+01:00 Nick Pentreath :

> +1 I think that’s practical
>
> On Fri, 10 Nov 2017 at 03:13, Erik Erlandson  wrote:
>
>> +1 on extending the deadline. It will significantly improve the logistics
>> for upstreaming the Kubernetes back-end.  Also agreed, on the general
>> realities of reduced bandwidth over the Nov-Dec holiday season.
>> Erik
>>
>> On Thu, Nov 9, 2017 at 6:03 PM, Matei Zaharia 
>> wrote:
>>
>>> I’m also +1 on extending this to get Kubernetes and other features in.
>>>
>>> Matei
>>>
>>> > On Nov 9, 2017, at 4:04 PM, Anirudh Ramanathan
>>>  wrote:
>>> >
>>> > This would help the community on the Kubernetes effort quite a bit -
>>> giving us additional time for reviews and testing for the 2.3 release.
>>> >
>>> > On Thu, Nov 9, 2017 at 3:56 PM, Justin Miller <
>>> justin.mil...@protectwise.com> wrote:
>>> > That sounds fine to me. I’m hoping that this ticket can make it into
>>> Spark 2.3: https://issues.apache.org/jira/browse/SPARK-18016
>>> >
>>> > It’s causing some pretty considerable problems when we alter the
>>> columns to be nullable, but we are OK for now without that.
>>> >
>>> > Best,
>>> > Justin
>>> >
>>> >> On Nov 9, 2017, at 4:54 PM, Michael Armbrust 
>>> wrote:
>>> >>
>>> >> According to the timeline posted on the website, we are nearing
>>> branch cut for Spark 2.3.  I'd like to propose pushing this out towards mid
>>> to late December for a couple of reasons and would like to hear what people
>>> think.
>>> >>
>>> >> 1. I've done release management during the Thanksgiving / Christmas
>>> time before and in my experience, we don't actually get a lot of testing
>>> during this time due to vacations and other commitments. I think beginning
>>> the RC process in early January would give us the best coverage in the
>>> shortest amount of time.
>>> >> 2. There are several large initiatives in progress that given a
>>> little more time would leave us with a much more exciting 2.3 release.
>>> Specifically, the work on the history server, Kubernetes and continuous
>>> processing.
>>> >> 3. Given the actual release date of Spark 2.2, I think we'll still
>>> get Spark 2.3 out roughly 6 months after.
>>> >>
>>> >> Thoughts?
>>> >>
>>> >> Michael
>>> >
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>


[ML] Migrating transformers from mllib to ml

2017-11-06 Thread Marco Gaido
Hello,

I saw that there are several TODOs to migrate some transformers (like
HashingTF and IDF) to use only ml.Vector in order to avoid the overhead of
converting them to the mllib ones and back.

Is there any reason why this has not been done so far? Is it to avoid code
duplication? If so, is it still an issue since we are going to deprecate
mllib from 2.3 (at least this is what I read on Spark docs)? If no, I can
work on this.

Thanks,
Marco