Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Reynold Xin
I filed a ticket: https://issues.apache.org/jira/browse/INFRA-17403

Please add your support there.

On Tue, Dec 11, 2018 at 4:58 PM, Sean Owen < sro...@apache.org > wrote:

> 
> I asked on the original ticket at https:/ / issues. apache. org/ jira/ browse/
> INFRA-17385 ( https://issues.apache.org/jira/browse/INFRA-17385 ) but no
> follow-up. Go ahead and open a new INFRA ticket.
> 
> On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I
>> want to put some pressure myself there too.
>> 
>> 
>> 
>> On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen < srowen@ apache. org (
>> sro...@apache.org ) > wrote:
>> 
>>> 
>>> 
>>> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra
>>> noise.
>>> 
>>> 
>>> 
>>> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin < vanzin@ cloudera. com (
>>> van...@cloudera.com ) > wrote:
>>> 
>>> 
 
 
 Hmm, it also seems that github comments are being sync'ed to jira. That's
 gonna get old very quickly, we should probably ask infra to disable that
 (if we can't do it ourselves).
 On Mon, Dec 10, 2018 at 9:13 AM Sean Owen < srowen@ apache. org (
 sro...@apache.org ) > wrote:
 
 
> 
> 
> Update for committers: now that my user ID is synced, I can successfully
> push to remote https:/ / github. com/ apache/ spark (
> https://github.com/apache/spark ) directly. Use that as the 'apache' 
> remote
> (if you like; gitbox also works). I confirmed the sync works both ways.
> 
> 
> 
> As a bonus you can directly close pull requests when needed instead of
> using "Close Stale PRs" pull requests.
> 
> 
> 
> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen < srowen@ apache. org (
> sro...@apache.org ) > wrote:
> 
> 
>> 
>> 
>> Per the thread last week, the Apache Spark repos have migrated from 
>> https:/
>> / git-wip-us. apache. org/ repos/ asf (
>> https://git-wip-us.apache.org/repos/asf ) to
>> https:/ / gitbox. apache. org/ repos/ asf (
>> https://gitbox.apache.org/repos/asf )
>> 
>> 
>> 
>> Non-committers:
>> 
>> 
>> 
>> This just means repointing any references to the old repository to the 
>> new
>> one. It won't affect you if you were already referencing https:/ / 
>> github.
>> com/ apache/ spark ( https://github.com/apache/spark ).
>> 
>> 
>> 
>> Committers:
>> 
>> 
>> 
>> Follow the steps at https:/ / reference. apache. org/ committer/ github (
>> https://reference.apache.org/committer/github ) to fully sync your ASF 
>> and
>> Github accounts, and then wait up to an hour for it to finish.
>> 
>> 
>> 
>> Then repoint your git-wip-us remotes to gitbox in your git checkouts. For
>> our standard setup that works with the merge script, that should be your
>> 'apache' remote. For example here are my current remotes:
>> 
>> 
>> 
>> $ git remote -v
>> apache https:/ / gitbox. apache. org/ repos/ asf/ spark. git (
>> https://gitbox.apache.org/repos/asf/spark.git ) (fetch) apache https:/ / 
>> gitbox.
>> apache. org/ repos/ asf/ spark. git (
>> https://gitbox.apache.org/repos/asf/spark.git ) (push) apache-github 
>> git://
>> github. com/ apache/ spark ( http://github.com/apache/spark ) (fetch)
>> apache-github git:// github. com/ apache/ spark (
>> http://github.com/apache/spark ) (push) origin https:/ / github. com/ 
>> srowen/
>> spark ( https://github.com/srowen/spark ) (fetch)
>> origin https:/ / github. com/ srowen/ spark (
>> https://github.com/srowen/spark ) (push)
>> upstream https:/ / github. com/ apache/ spark (
>> https://github.com/apache/spark ) (fetch)
>> upstream https:/ / github. com/ apache/ spark (
>> https://github.com/apache/spark ) (push)
>> 
>> 
>> 
>> In theory we also have read/write access to github. com (
>> http://github.com/ ) now too, but right now it hadn't yet worked for me. 
>> It
>> may need to sync. This note just makes sure anyone knows how to keep
>> pushing commits right now to the new ASF repo.
>> 
>> 
>> 
>> Report any problems here!
>> 
>> 
>> 
>> Sean
>> 
>> 
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
 
 
 
 --
 Marcelo
 
 
>>> 
>>> 
>>> 
>>> - To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Sean Owen
I asked on the original ticket at
https://issues.apache.org/jira/browse/INFRA-17385 but no follow-up. Go
ahead and open a new INFRA ticket.

On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin  wrote:

> Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I
> want to put some pressure myself there too.
>
>
> On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen  wrote:
>
>> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra
>> noise.
>>
>> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin 
>> wrote:
>>
>> Hmm, it also seems that github comments are being sync'ed to jira. That's
>> gonna get old very quickly, we should probably ask infra to disable that
>> (if we can't do it ourselves).
>> On Mon, Dec 10, 2018 at 9:13 AM Sean Owen  wrote:
>>
>> Update for committers: now that my user ID is synced, I can successfully
>> push to remote https://github.com/apache/spark directly. Use that as the
>> 'apache' remote (if you like; gitbox also works). I confirmed the sync
>> works both ways.
>>
>> As a bonus you can directly close pull requests when needed instead of
>> using "Close Stale PRs" pull requests.
>>
>> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen  wrote:
>>
>> Per the thread last week, the Apache Spark repos have migrated from
>> https://git-wip-us.apache.org/repos/asf to
>> https://gitbox.apache.org/repos/asf
>>
>> Non-committers:
>>
>> This just means repointing any references to the old repository to the
>> new one. It won't affect you if you were already referencing
>> https://github.com/apache/spark .
>>
>> Committers:
>>
>> Follow the steps at https://reference.apache.org/committer/github to
>> fully sync your ASF and Github accounts, and then wait up to an hour for it
>> to finish.
>>
>> Then repoint your git-wip-us remotes to gitbox in your git checkouts. For
>> our standard setup that works with the merge script, that should be your
>> 'apache' remote. For example here are my current remotes:
>>
>> $ git remote -v
>> apache https://gitbox.apache.org/repos/asf/spark.git (fetch) apache
>> https://gitbox.apache.org/repos/asf/spark.git (push) apache-github git://
>> github.com/apache/spark (fetch) apache-github git://
>> github.com/apache/spark (push) origin https://github.com/srowen/spark
>> (fetch)
>> origin https://github.com/srowen/spark (push)
>> upstream https://github.com/apache/spark (fetch)
>> upstream https://github.com/apache/spark (push)
>>
>> In theory we also have read/write access to github.com now too, but
>> right now it hadn't yet worked for me. It may need to sync. This note just
>> makes sure anyone knows how to keep pushing commits right now to the new
>> ASF repo.
>>
>> Report any problems here!
>>
>> Sean
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
>> Marcelo
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>


Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Hyukjin Kwon
Me too. I want to put some input as well if that can be helpful.

On Wed, 12 Dec 2018, 8:20 am Reynold Xin  Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I
> want to put some pressure myself there too.
>
>
> On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen  wrote:
>
>> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra
>> noise.
>>
>> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin 
>> wrote:
>>
>> Hmm, it also seems that github comments are being sync'ed to jira. That's
>> gonna get old very quickly, we should probably ask infra to disable that
>> (if we can't do it ourselves).
>> On Mon, Dec 10, 2018 at 9:13 AM Sean Owen  wrote:
>>
>> Update for committers: now that my user ID is synced, I can successfully
>> push to remote https://github.com/apache/spark directly. Use that as the
>> 'apache' remote (if you like; gitbox also works). I confirmed the sync
>> works both ways.
>>
>> As a bonus you can directly close pull requests when needed instead of
>> using "Close Stale PRs" pull requests.
>>
>> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen  wrote:
>>
>> Per the thread last week, the Apache Spark repos have migrated from
>> https://git-wip-us.apache.org/repos/asf to
>> https://gitbox.apache.org/repos/asf
>>
>> Non-committers:
>>
>> This just means repointing any references to the old repository to the
>> new one. It won't affect you if you were already referencing
>> https://github.com/apache/spark .
>>
>> Committers:
>>
>> Follow the steps at https://reference.apache.org/committer/github to
>> fully sync your ASF and Github accounts, and then wait up to an hour for it
>> to finish.
>>
>> Then repoint your git-wip-us remotes to gitbox in your git checkouts. For
>> our standard setup that works with the merge script, that should be your
>> 'apache' remote. For example here are my current remotes:
>>
>> $ git remote -v
>> apache https://gitbox.apache.org/repos/asf/spark.git (fetch) apache
>> https://gitbox.apache.org/repos/asf/spark.git (push) apache-github git://
>> github.com/apache/spark (fetch) apache-github git://
>> github.com/apache/spark (push) origin https://github.com/srowen/spark
>> (fetch)
>> origin https://github.com/srowen/spark (push)
>> upstream https://github.com/apache/spark (fetch)
>> upstream https://github.com/apache/spark (push)
>>
>> In theory we also have read/write access to github.com now too, but
>> right now it hadn't yet worked for me. It may need to sync. This note just
>> makes sure anyone knows how to keep pushing commits right now to the new
>> ASF repo.
>>
>> Report any problems here!
>>
>> Sean
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
>> Marcelo
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>


Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Reynold Xin
Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I want 
to put some pressure myself there too.

On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen < sro...@apache.org > wrote:

> 
> 
> 
> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra
> noise.
> 
> 
> 
> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin < vanzin@ cloudera. com (
> van...@cloudera.com ) > wrote:
> 
> 
>> 
>> 
>> Hmm, it also seems that github comments are being sync'ed to jira. That's
>> gonna get old very quickly, we should probably ask infra to disable that
>> (if we can't do it ourselves).
>> On Mon, Dec 10, 2018 at 9:13 AM Sean Owen < srowen@ apache. org (
>> sro...@apache.org ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Update for committers: now that my user ID is synced, I can successfully
>>> push to remote https:/ / github. com/ apache/ spark (
>>> https://github.com/apache/spark ) directly. Use that as the 'apache' remote
>>> (if you like; gitbox also works). I confirmed the sync works both ways.
>>> 
>>> 
>>> 
>>> As a bonus you can directly close pull requests when needed instead of
>>> using "Close Stale PRs" pull requests.
>>> 
>>> 
>>> 
>>> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen < srowen@ apache. org (
>>> sro...@apache.org ) > wrote:
>>> 
>>> 
 
 
 Per the thread last week, the Apache Spark repos have migrated from https:/
 / git-wip-us. apache. org/ repos/ asf (
 https://git-wip-us.apache.org/repos/asf ) to
 https:/ / gitbox. apache. org/ repos/ asf (
 https://gitbox.apache.org/repos/asf )
 
 
 
 Non-committers:
 
 
 
 This just means repointing any references to the old repository to the new
 one. It won't affect you if you were already referencing https:/ / github.
 com/ apache/ spark ( https://github.com/apache/spark ).
 
 
 
 Committers:
 
 
 
 Follow the steps at https:/ / reference. apache. org/ committer/ github (
 https://reference.apache.org/committer/github ) to fully sync your ASF and
 Github accounts, and then wait up to an hour for it to finish.
 
 
 
 Then repoint your git-wip-us remotes to gitbox in your git checkouts. For
 our standard setup that works with the merge script, that should be your
 'apache' remote. For example here are my current remotes:
 
 
 
 $ git remote -v
 apache https:/ / gitbox. apache. org/ repos/ asf/ spark. git (
 https://gitbox.apache.org/repos/asf/spark.git ) (fetch) apache https:/ / 
 gitbox.
 apache. org/ repos/ asf/ spark. git (
 https://gitbox.apache.org/repos/asf/spark.git ) (push) apache-github
 git://github.com/apache/spark (fetch) apache-github
 git://github.com/apache/spark (push) origin https:/ / github. com/ srowen/
 spark ( https://github.com/srowen/spark ) (fetch)
 origin https:/ / github. com/ srowen/ spark (
 https://github.com/srowen/spark ) (push)
 upstream https:/ / github. com/ apache/ spark (
 https://github.com/apache/spark ) (fetch)
 upstream https:/ / github. com/ apache/ spark (
 https://github.com/apache/spark ) (push)
 
 
 
 In theory we also have read/write access to github. com (
 http://github.com/ ) now too, but right now it hadn't yet worked for me. It
 may need to sync. This note just makes sure anyone knows how to keep
 pushing commits right now to the new ASF repo.
 
 
 
 Report any problems here!
 
 
 
 Sean
 
 
>>> 
>>> 
>>> 
>>> - To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Marcelo
>> 
>> 
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: GitHub sync

2018-12-11 Thread Dongjoon Hyun
Now, it's recovered.

Dongjoon.

On Tue, Dec 11, 2018 at 2:15 PM Dongjoon Hyun 
wrote:

> https://issues.apache.org/jira/browse/INFRA-17401 is filed.
>
> Dongjoon.
>
> On Tue, Dec 11, 2018 at 12:49 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Currently, GitHub `spark:branch-2.4` is out of sync (with two commits).
>>
>>
>> https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.4
>> https://github.com/apache/spark/commits/branch-2.4
>>
>> I did the followings already.
>>
>>1. Wait for the next commit.
>>2. Trigger resync at Apache Selfserv site
>>3. Merge and push directly to GitHub `branch-2.4` (thanks to GitBox
>> transition.)
>>
>> However, after syncing correctly with 3, the new patches are gone.
>> Technically, GitHub `branch-2.4` seems to be force-pushed by some other
>> entity. After more investigation, I'm going to file an INFRA issue for
>> this. Please note this.
>>
>> Bests,
>> Dongjoon.
>>
>>


Re: GitHub sync

2018-12-11 Thread Dongjoon Hyun
https://issues.apache.org/jira/browse/INFRA-17401 is filed.

Dongjoon.

On Tue, Dec 11, 2018 at 12:49 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Currently, GitHub `spark:branch-2.4` is out of sync (with two commits).
>
>
> https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.4
> https://github.com/apache/spark/commits/branch-2.4
>
> I did the followings already.
>
>1. Wait for the next commit.
>2. Trigger resync at Apache Selfserv site
>3. Merge and push directly to GitHub `branch-2.4` (thanks to GitBox
> transition.)
>
> However, after syncing correctly with 3, the new patches are gone.
> Technically, GitHub `branch-2.4` seems to be force-pushed by some other
> entity. After more investigation, I'm going to file an INFRA issue for
> this. Please note this.
>
> Bests,
> Dongjoon.
>
>


GitHub sync

2018-12-11 Thread Dongjoon Hyun
Hi, All.

Currently, GitHub `spark:branch-2.4` is out of sync (with two commits).

https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.4
https://github.com/apache/spark/commits/branch-2.4

I did the followings already.

   1. Wait for the next commit.
   2. Trigger resync at Apache Selfserv site
   3. Merge and push directly to GitHub `branch-2.4` (thanks to GitBox
transition.)

However, after syncing correctly with 3, the new patches are gone.
Technically, GitHub `branch-2.4` seems to be force-pushed by some other
entity. After more investigation, I'm going to file an INFRA issue for
this. Please note this.

Bests,
Dongjoon.


Re: proposal for expanded & consistent timestamp types

2018-12-11 Thread Li Jin
Of course. I added some comments in the doc.

On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid  wrote:

> Hi Li,
>
> thanks for the comments!  I admit I had not thought very much about python
> support, its a good point.  But I'd actually like to clarify one thing
> about the doc -- though it discusses java types, the point is actually
> about having support for these logical types at the SQL level.  The doc
> uses java names instead of SQL names just because there is so much
> confusion around the SQL names, as they haven't been implemented
> consistently.  Once there is support for the additional logical types, then
> we'd absolutely want to get the same support in python.
>
> Its great to hear there are existing python types we can map each behavior
> to.  Could you add a comment on the doc on each of the types, mentioning
> the equivalent in python?
>
> thanks,
> Imran
>
> On Fri, Dec 7, 2018 at 1:33 PM Li Jin  wrote:
>
>> Imran,
>>
>> Thanks for sharing this. When working on interop between Spark and
>> Pandas/Arrow in the past, we also faced some issues due to the different
>> definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp
>> has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime
>> or OffsetDateTime semantics. (Detailed discussion is in the PR:
>> https://github.com/apache/spark/pull/18664#issuecomment-316554156.)
>>
>> For one I am excited to see this effort going but also would love to see
>> interop of Python to be included/considered in the picture. I don't think
>> it adds much to what has already been proposed already because Python
>> timestamps are basically LocalDateTime or OffsetDateTime.
>>
>> Li
>>
>>
>>
>> On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid 
>> wrote:
>>
>>> Hi,
>>>
>>> I'd like to discuss the future of timestamp support in Spark, in
>>> particular with respect of handling timezones in different SQL types.   In
>>> a nutshell:
>>>
>>> * There are at least 3 different ways of handling the timestamp type
>>> across timezone changes
>>> * We'd like Spark to clearly distinguish the 3 types (it currently
>>> implements 1 of them), in a way that is backwards compatible, and also
>>> compliant with the SQL standard.
>>> * We'll get agreement across Spark, Hive, and Impala.
>>>
>>> Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed
>>> doc, describing the problem in more detail, the state of various SQL
>>> engines, and how we can get to a better state without breaking any current
>>> use cases.  The proposal is good for Spark by itself.  We're also going to
>>> the Hive & Impala communities with this proposal, as its better for
>>> everyone if everything is compatible.
>>>
>>> Note that this isn't proposing a specific implementation in Spark as
>>> yet, just a description of the overall problem and our end goal.  We're
>>> going to each community to get agreement on the overall direction.  Then
>>> each community can figure out specifics as they see fit.  (I don't think
>>> there are any technical hurdles with this approach eg. to decide whether
>>> this would be even possible in Spark.)
>>>
>>> Here's a link to the doc Zoltan has put together.  It is a bit long, but
>>> it explains how such a seemingly simple concept has become such a mess and
>>> how we can get to a better state.
>>>
>>>
>>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky
>>>
>>> Please review the proposal and let us know your opinions, concerns and
>>> suggestions.
>>>
>>> thanks,
>>> Imran
>>>
>>


[Apache Beam] Custom DataSourceV2 instanciation: parameters passing and Encoders

2018-12-11 Thread Etienne Chauchot
Hi Spark guys,

I'm Etienne Chauchot and I'm a committer on the Apache Beam project. 

We have what we call runners. They are pieces of software that translate 
pipelines written using Beam API into pipelines
that use native execution engine API. Currently, the Spark runner uses old RDD 
/ DStream APIs. 
I'm writing a new runner that will use structured streaming (but not continuous 
processing, and also no schema for now).

I am just starting. I'm currently trying to map our sources to yours. I'm 
targeting new DataSourceV2 API. It maps pretty
well with Beam sources but I have a problem with instanciation of the custom 
source.
I searched for an answer in stack-overflow and user ML with no luck. I guess it 
is a too specific question:

When visiting Beam DAG I have access to Beam objects such as Source and Reader 
that I need to map to MicroBatchReader
and InputPartitionReader.
As far as I understand, a custom DataSourceV2 is instantiated automatically by 
spark thanks to
sparkSession.readStream().format(providerClassName) or similar code. The 
problem is that I can only pass options of
primitive types + String so I cannot pass the Beam Source to DataSourceV2. 
=> Is there a way to do so ?


Also I get as an output a Dataset. The Row contains an instance of Beam 
WindowedValue, T is the type parameter
of the Source. I  do a map on the Dataset to transform it to a 
Dataset>. I have a question related to
the Encoder: 
=> how to properly create an Encoder for the generic type WindowedValue to 
use in the map?

Here is the code:
https://github.com/apache/beam/tree/spark-runner_structured-streaming

And more specially:

https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/ReadSourceTranslatorBatch.java

https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/io/DatasetSource.java

Thanks,

Etienne








Re: Self join

2018-12-11 Thread Jörn Franke
I don’t know your exact underlying business problem,  but maybe a graph 
solution, such as Spark Graphx meets better your requirements. Usually 
self-joins are done to address some kind of graph problem (even if you would 
not describe it as such) and is for these kind of problems much more efficient. 

> Am 11.12.2018 um 12:44 schrieb Marco Gaido :
> 
> Hi all,
> 
> I'd like to bring to the attention of a more people a problem which has been 
> there for long, ie, self joins. Currently, we have many troubles with them. 
> This has been reported several times to the community and seems to affect 
> many people, but as of now no solution has been accepted for it.
> 
> I created a PR some time ago in order to address the problem 
> (https://github.com/apache/spark/pull/21449), but Wenchen mentioned he tried 
> to fix this problem too but so far no attempt was successful because there is 
> no clear semantic 
> (https://github.com/apache/spark/pull/21449#issuecomment-393554552).
> 
> So I'd like to propose to discuss here which is the best approach for 
> tackling this issue, which I think would be great to fix for 3.0.0, so if we 
> decide to introduce breaking changes in the design, we can do that.
> 
> Thoughts on this?
> 
> Thanks,
> Marco


Re: proposal for expanded & consistent timestamp types

2018-12-11 Thread Imran Rashid
Hi Li,

thanks for the comments!  I admit I had not thought very much about python
support, its a good point.  But I'd actually like to clarify one thing
about the doc -- though it discusses java types, the point is actually
about having support for these logical types at the SQL level.  The doc
uses java names instead of SQL names just because there is so much
confusion around the SQL names, as they haven't been implemented
consistently.  Once there is support for the additional logical types, then
we'd absolutely want to get the same support in python.

Its great to hear there are existing python types we can map each behavior
to.  Could you add a comment on the doc on each of the types, mentioning
the equivalent in python?

thanks,
Imran

On Fri, Dec 7, 2018 at 1:33 PM Li Jin  wrote:

> Imran,
>
> Thanks for sharing this. When working on interop between Spark and
> Pandas/Arrow in the past, we also faced some issues due to the different
> definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp
> has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime
> or OffsetDateTime semantics. (Detailed discussion is in the PR:
> https://github.com/apache/spark/pull/18664#issuecomment-316554156.)
>
> For one I am excited to see this effort going but also would love to see
> interop of Python to be included/considered in the picture. I don't think
> it adds much to what has already been proposed already because Python
> timestamps are basically LocalDateTime or OffsetDateTime.
>
> Li
>
>
>
> On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid 
> wrote:
>
>> Hi,
>>
>> I'd like to discuss the future of timestamp support in Spark, in
>> particular with respect of handling timezones in different SQL types.   In
>> a nutshell:
>>
>> * There are at least 3 different ways of handling the timestamp type
>> across timezone changes
>> * We'd like Spark to clearly distinguish the 3 types (it currently
>> implements 1 of them), in a way that is backwards compatible, and also
>> compliant with the SQL standard.
>> * We'll get agreement across Spark, Hive, and Impala.
>>
>> Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed
>> doc, describing the problem in more detail, the state of various SQL
>> engines, and how we can get to a better state without breaking any current
>> use cases.  The proposal is good for Spark by itself.  We're also going to
>> the Hive & Impala communities with this proposal, as its better for
>> everyone if everything is compatible.
>>
>> Note that this isn't proposing a specific implementation in Spark as yet,
>> just a description of the overall problem and our end goal.  We're going to
>> each community to get agreement on the overall direction.  Then each
>> community can figure out specifics as they see fit.  (I don't think there
>> are any technical hurdles with this approach eg. to decide whether this
>> would be even possible in Spark.)
>>
>> Here's a link to the doc Zoltan has put together.  It is a bit long, but
>> it explains how such a seemingly simple concept has become such a mess and
>> how we can get to a better state.
>>
>>
>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky
>>
>> Please review the proposal and let us know your opinions, concerns and
>> suggestions.
>>
>> thanks,
>> Imran
>>
>


Re: Self join

2018-12-11 Thread Ryan Blue
Marco,

Thanks for starting the discussion! I think it would be great to have a
clear description of the problem and a proposed solution. Do you have
anything like that? It would help bring the rest of us up to speed without
reading different pull requests.

Thanks!

rb

On Tue, Dec 11, 2018 at 3:54 AM Marco Gaido  wrote:

> Hi all,
>
> I'd like to bring to the attention of a more people a problem which has
> been there for long, ie, self joins. Currently, we have many troubles with
> them. This has been reported several times to the community and seems to
> affect many people, but as of now no solution has been accepted for it.
>
> I created a PR some time ago in order to address the problem (
> https://github.com/apache/spark/pull/21449), but Wenchen mentioned he
> tried to fix this problem too but so far no attempt was successful because
> there is no clear semantic (
> https://github.com/apache/spark/pull/21449#issuecomment-393554552).
>
> So I'd like to propose to discuss here which is the best approach for
> tackling this issue, which I think would be great to fix for 3.0.0, so if
> we decide to introduce breaking changes in the design, we can do that.
>
> Thoughts on this?
>
> Thanks,
> Marco
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Pushdown in DataSourceV2 question

2018-12-11 Thread Ryan Blue
In v2, it is up to the data source to tell Spark that a pushed filter is
satisfied, by returning the pushed filters that Spark should run. You can
indicate that a filter is handled by the source by not returning it for
Spark. You can also show that a filter is used by the source by showing it
in the output for the plan node, which I think is the `description` method
in the latest set of changes.

If you want to check with an external source to see what can be pushed
down, then you can do that any time in your source implementation.

On Tue, Dec 11, 2018 at 3:46 AM Noritaka Sekiyama 
wrote:

> Hi,
> Thank you for responding to this thread. I'm really interested in this
> discussion.
>
> My original idea might be the same as what Alessandro said, introducing a
> mechanism that Spark can communicate with DataSource and get metadata which
> shows if pushdown is supported or not.
> I'm wondering if it will be such expensive or not..
>
>
>
>
> 2018年12月10日(月) 20:12 Alessandro Solimando  >:
>
>> I think you are generally right, but there are so many different
>> scenarios that it might not always be the best option, consider for
>> instance a "fast" network in between a single data source and "Spark", lots
>> of data, an "expensive" (with low selectivity) expression as Wenchen
>> suggested.
>>
>> In such a case it looks to me that you end up "re-scanning" the whole
>> dataset just to make sure the filter has been applied, where having such an
>> info as metadata or via a communication protocol with the data source (if
>> supported) would be cheaper.
>>
>> If there is no support at all for such a mechanism I think it could be
>> worth exploring a bit more the idea. However, supporting such a mechanism
>> would require some developing effort for each datasource to support (e.g.,
>> asking the datasource for the physical plan applied at query time, the
>> ability to parse it to extract relevant info and act on them), as I am not
>> aware of any general interface for exchanging such information.
>>
>>
>>
>> On Sun, 9 Dec 2018 at 15:34, Jörn Franke  wrote:
>>
>>> It is not about lying or not or trust or not. Some or all filters may
>>> not be supported by a data source. Some might only be applied under certain
>>> environmental conditions (eg enough memory etc).
>>>
>>> It is much more expensive to communicate between Spark and a data source
>>> which filters have been applied or not than just checking it as Spark does.
>>> Especially if you have several different data sources at the same time
>>> (joins etc).
>>>
>>> Am 09.12.2018 um 14:30 schrieb Wenchen Fan :
>>>
>>> expressions/functions can be expensive and I do think Spark should trust
>>> data source and not re-apply pushed filters. If data source lies, many
>>> things can go wrong...
>>>
>>> On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke  wrote:
>>>
 Well even if it has to apply it again, if pushdown is activated then it
 will be much less cost for spark to see if the filter has been applied or
 not. Applying the filter is negligible, what it really avoids if the file
 format implements it is IO cost (for reading) as well as cost for
 converting from the file format internal datatype to the one of Spark.
 Those two things are very expensive, but not the filter check. In the end,
 it could be also data source internal reasons not to apply a filter (there
 can be many depending on your scenario, the format etc). Instead of
 “discussing” between Spark and the data source it is much less costly that
 Spark checks that the filters are consistently applied.

 Am 09.12.2018 um 12:39 schrieb Alessandro Solimando <
 alessandro.solima...@gmail.com>:

 Hello,
 that's an interesting question, but after Frank's reply I am a bit
 puzzled.

 If there is no control over the pushdown status how can Spark guarantee
 the correctness of the final query?

 Consider a filter pushed down to the data source, either Spark has to
 know if it has been applied or not, or it has to re-apply the filter anyway
 (and pay the price for that).

 Is there any other option I am not considering?

 Best regards,
 Alessandro

 Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke  ha
 scritto:

> BTW. Even for json a pushdown can make sense to avoid that data is
> unnecessary ending in Spark ( because it would cause unnecessary 
> overhead).
> In the datasource v2 api you need to implement a SupportsPushDownFilter
>
> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <
> moomind...@gmail.com>:
> >
> > Hi,
> >
> > I'm a support engineer, interested in DataSourceV2.
> >
> > Recently I had some pain to troubleshoot to check if pushdown is
> actually applied or not.
> > I noticed that DataFrame's explain() method shows pushdown even for
> JSON.
> > It totally depends on DataSource side, I believe. However, I would

Self join

2018-12-11 Thread Marco Gaido
Hi all,

I'd like to bring to the attention of a more people a problem which has
been there for long, ie, self joins. Currently, we have many troubles with
them. This has been reported several times to the community and seems to
affect many people, but as of now no solution has been accepted for it.

I created a PR some time ago in order to address the problem (
https://github.com/apache/spark/pull/21449), but Wenchen mentioned he tried
to fix this problem too but so far no attempt was successful because there
is no clear semantic (
https://github.com/apache/spark/pull/21449#issuecomment-393554552).

So I'd like to propose to discuss here which is the best approach for
tackling this issue, which I think would be great to fix for 3.0.0, so if
we decide to introduce breaking changes in the design, we can do that.

Thoughts on this?

Thanks,
Marco


Re: Pushdown in DataSourceV2 question

2018-12-11 Thread Noritaka Sekiyama
Hi,
Thank you for responding to this thread. I'm really interested in this
discussion.

My original idea might be the same as what Alessandro said, introducing a
mechanism that Spark can communicate with DataSource and get metadata which
shows if pushdown is supported or not.
I'm wondering if it will be such expensive or not..




2018年12月10日(月) 20:12 Alessandro Solimando :

> I think you are generally right, but there are so many different scenarios
> that it might not always be the best option, consider for instance a "fast"
> network in between a single data source and "Spark", lots of data, an
> "expensive" (with low selectivity) expression as Wenchen suggested.
>
> In such a case it looks to me that you end up "re-scanning" the whole
> dataset just to make sure the filter has been applied, where having such an
> info as metadata or via a communication protocol with the data source (if
> supported) would be cheaper.
>
> If there is no support at all for such a mechanism I think it could be
> worth exploring a bit more the idea. However, supporting such a mechanism
> would require some developing effort for each datasource to support (e.g.,
> asking the datasource for the physical plan applied at query time, the
> ability to parse it to extract relevant info and act on them), as I am not
> aware of any general interface for exchanging such information.
>
>
>
> On Sun, 9 Dec 2018 at 15:34, Jörn Franke  wrote:
>
>> It is not about lying or not or trust or not. Some or all filters may not
>> be supported by a data source. Some might only be applied under certain
>> environmental conditions (eg enough memory etc).
>>
>> It is much more expensive to communicate between Spark and a data source
>> which filters have been applied or not than just checking it as Spark does.
>> Especially if you have several different data sources at the same time
>> (joins etc).
>>
>> Am 09.12.2018 um 14:30 schrieb Wenchen Fan :
>>
>> expressions/functions can be expensive and I do think Spark should trust
>> data source and not re-apply pushed filters. If data source lies, many
>> things can go wrong...
>>
>> On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke  wrote:
>>
>>> Well even if it has to apply it again, if pushdown is activated then it
>>> will be much less cost for spark to see if the filter has been applied or
>>> not. Applying the filter is negligible, what it really avoids if the file
>>> format implements it is IO cost (for reading) as well as cost for
>>> converting from the file format internal datatype to the one of Spark.
>>> Those two things are very expensive, but not the filter check. In the end,
>>> it could be also data source internal reasons not to apply a filter (there
>>> can be many depending on your scenario, the format etc). Instead of
>>> “discussing” between Spark and the data source it is much less costly that
>>> Spark checks that the filters are consistently applied.
>>>
>>> Am 09.12.2018 um 12:39 schrieb Alessandro Solimando <
>>> alessandro.solima...@gmail.com>:
>>>
>>> Hello,
>>> that's an interesting question, but after Frank's reply I am a bit
>>> puzzled.
>>>
>>> If there is no control over the pushdown status how can Spark guarantee
>>> the correctness of the final query?
>>>
>>> Consider a filter pushed down to the data source, either Spark has to
>>> know if it has been applied or not, or it has to re-apply the filter anyway
>>> (and pay the price for that).
>>>
>>> Is there any other option I am not considering?
>>>
>>> Best regards,
>>> Alessandro
>>>
>>> Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke  ha
>>> scritto:
>>>
 BTW. Even for json a pushdown can make sense to avoid that data is
 unnecessary ending in Spark ( because it would cause unnecessary overhead).
 In the datasource v2 api you need to implement a SupportsPushDownFilter

 > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <
 moomind...@gmail.com>:
 >
 > Hi,
 >
 > I'm a support engineer, interested in DataSourceV2.
 >
 > Recently I had some pain to troubleshoot to check if pushdown is
 actually applied or not.
 > I noticed that DataFrame's explain() method shows pushdown even for
 JSON.
 > It totally depends on DataSource side, I believe. However, I would
 like Spark to have some way to confirm whether specific pushdown is
 actually applied in DataSource or not.
 >
 > # Example
 > val df = spark.read.json("s3://sample_bucket/people.json")
 > df.printSchema()
 > df.filter($"age" > 20).explain()
 >
 > root
 >  |-- age: long (nullable = true)
 >  |-- name: string (nullable = true)
 >
 > == Physical Plan ==
 > *Project [age#47L, name#48]
 > +- *Filter (isnotnull(age#47L) && (age#47L > 20))
 >+- *FileScan json [age#47L,name#48] Batched: false, Format: JSON,
 Location: InMemoryFileIndex[s3://sample_bucket/people.json],
 PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)],