Re: [SQL] parse_url does not work for Internationalized domain names ?

2018-01-11 Thread StanZhai
This problem was introduced by
 which is designed to
improve performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

I think it's a regression.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL] parse_url does not work for Internationalized domain names ?

2018-01-11 Thread StanZhai
This problem was introduced by
 which is designed to
improve performance of PARSE_URL().The same issue exists in the following
SQL:```SQLSELECT PARSE_URL('http://stanzhai.site?p=["abc;]', 'QUERY', 'p')//
return null in Spark 2.1+// return ["abc"] less than Spark 2.1```I think
it's a regression.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re: Schema Evolution in Apache Spark

2018-01-11 Thread Georg Heiler
Isn't this related to the data format used, i.e. parquet, Avro, ... which
already support changing schema?

Dongjoon Hyun  schrieb am Fr., 12. Jan. 2018 um
02:30 Uhr:

> Hi, All.
>
> A data schema can evolve in several ways and Apache Spark 2.3 already
> supports the followings for file-based data sources like
> CSV/JSON/ORC/Parquet.
>
> 1. Add a column
> 2. Remove a column
> 3. Change a column position
> 4. Change a column type
>
> Can we guarantee users some schema evolution coverage on file-based data
> sources by adding schema evolution test suites explicitly? So far, there
> are some test cases.
>
> For simplicity, I have several assumptions on schema evolution.
>
> 1. A safe evolution without data loss.
> - e.g. from small types to larger types like int-to-long, not vice
> versa.
> 2. Final schema is given by users (or Hive)
> 3. Simple Spark data types supported by Spark vectorized execution.
>
> I made a test case PR to receive your opinions for this.
>
> [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
> data sources
> - https://github.com/apache/spark/pull/20208
>
> Could you take a look and give some opinions?
>
> Bests,
> Dongjoon.
>


Accessing the SQL parser

2018-01-11 Thread Abdeali Kothari
I was writing some code to try to auto find a list of tables and databases
being used in a SparkSQL query. Mainly I was looking to auto-check the
permissions and owners of all the tables a query will be trying to access.

I was wondering whether PySpark has some method for me to directly use the
AST that SparkSQL uses?

Or is there some documentation on how I can generate and understand the AST
in Spark?

Regards,
AbdealiJK


[SQL] parse_url does not work for Internationalized domain names ?

2018-01-11 Thread yash datta
Hi devs,

Stumbled across an interesting problem with the parse_url function that has
been implemented in spark in
https://issues.apache.org/jira/browse/SPARK-16281

When using internationalized Domains in the urls like:

val url = "http://правительство.рф "

The parse_url returns null, but works fine when using the hive 's version
of parse_url

On digging further, found that the difference is in below call in spark:

private def getUrl(url: UTF8String): URI = {
  try {
new URI(url.toString)
  } catch {
case e: URISyntaxException => null
  }
}


while hive uses java.net.URL:

url = new URL(urlStr)


Sure enough, this simple test demonstrates URL works but URI does not in
this case:

val url = "http://правительство.рф "

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф


To reproduce the problem on spark-sql:

spark-sql> select parse_url('http://千夏ともか.test
', 'HOST');
returns NULL

Could someone  please explain the reason of using URI instead of URL ? Does
this problem warrant creating a jira ticket ?


Best Regards
Yash

-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.


Build timed out for `branch-2.3 (hadoop-2.7)`

2018-01-11 Thread Dongjoon Hyun
Hi, All and Shane.

Can we increase the build time for `branch-2.3` during 2.3 RC period?

There are two known test issues, but the Jenkins on branch-2.3 with
hadoop-2.7 fails with build timeout. So, it's difficult to monitor whether
the branch is healthy or not.

Build timed out (after 255 minutes). Marking the build as aborted.
Build was aborted
...
Finished: ABORTED

-
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/60/console
-
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/47/console

Bests,
Dongjoon.


Schema Evolution in Apache Spark

2018-01-11 Thread Dongjoon Hyun
Hi, All.

A data schema can evolve in several ways and Apache Spark 2.3 already
supports the followings for file-based data sources like
CSV/JSON/ORC/Parquet.

1. Add a column
2. Remove a column
3. Change a column position
4. Change a column type

Can we guarantee users some schema evolution coverage on file-based data
sources by adding schema evolution test suites explicitly? So far, there
are some test cases.

For simplicity, I have several assumptions on schema evolution.

1. A safe evolution without data loss.
- e.g. from small types to larger types like int-to-long, not vice
versa.
2. Final schema is given by users (or Hive)
3. Simple Spark data types supported by Spark vectorized execution.

I made a test case PR to receive your opinions for this.

[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
data sources
- https://github.com/apache/spark/pull/20208

Could you take a look and give some opinions?

Bests,
Dongjoon.


Re: Branch 2.3 is cut

2018-01-11 Thread Sameer Agarwal
All major blockers have now been resolved with the exception of a couple of
known test issues (SPARK-23020
 and SPARK-23000
) that are being
actively worked on. Unless there is an objection, I'll shortly followup
with an RC to get the QA started in parallel.

Thanks,
Sameer

On Mon, Jan 8, 2018 at 5:03 PM, Sameer Agarwal 
wrote:

> Hello everyone,
>
> Just a quick update on the release. There are currently 2 correctness
> blockers (SPARK-22984 
>  and SPARK-22982 )
> that are targeted against 2.3.0. We'll go ahead and create an RC as soon as
> they're resolved. All relevant jenkins jobs for the release branch can be
> accessed at: https://amplab.cs.berkeley.edu/jenkins/
>
> Regards,
> Sameer
>
> On Mon, Jan 1, 2018 at 5:22 PM, Sameer Agarwal 
> wrote:
>
>> We've just cut the release branch for Spark 2.3. Committers, please
>> backport all important bug fixes and PRs as appropriate.
>>
>> Next, I'll go ahead and create the jenkins jobs for the release branch
>> and then follow up with an RC early next week.
>>
>
>
>
> --
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
>



-- 
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag


Structured Streaming with S3 file source duplicates data because of eventual consistency

2018-01-11 Thread Yash Sharma
Hi Team,
I have been using Structured Streaming with the S3 data source but I am
seeing it duplicate the data intermittently. New run seem to fix it, but
the duplication happens ~10% of time. The ratio increases with more number
of files in the source. Investigating more, I see this is clearly an issue
with S3's eventual consistency, and spark re-processes the task twice,
because its not able to verify if the task successfully wrote the output of
completed task.

I have added all the details of investigation in the ticket below with code
and error logs.Is there a way we can address this issue and is there
anything I can help out with.

https://issues.apache.org/jira/browse/SPARK-23050

Cheers


Re: Palantir replease under org.apache.spark?

2018-01-11 Thread Prajwal Tuladhar
If you check
https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22,
it's only listing "official" ones.

On Thu, Jan 11, 2018 at 7:59 PM, Steve Loughran 
wrote:

>
>
> On 9 Jan 2018, at 18:10, Sean Owen  wrote:
>
> Just to follow up -- those are actually in a Palantir repo, not Central.
> Deploying to Central would be uncourteous, but this approach is legitimate
> and how it has to work for vendors to release distros of Spark etc.
>
>
> ASF processes are set up to stop people pushing any org.apache. artifact
> out to mvncentral without going through the signing process; if someone
> does, then that's a major problem,
>
> On Tue, Jan 9, 2018 at 11:43 AM Nan Zhu  wrote:
>
>> Hi, all
>>
>> Out of curious, I just found a bunch of Palantir release under
>> org.apache.spark in maven central (https://mvnrepository.com/
>> artifact/org.apache.spark/spark-core_2.11)?
>>
>> Is it on purpose?
>>
>> Best,
>>
>> Nan
>>
>>
>>
>


-- 
--
Cheers,
Praj


Re: Palantir replease under org.apache.spark?

2018-01-11 Thread Steve Loughran


On 9 Jan 2018, at 18:10, Sean Owen 
> wrote:

Just to follow up -- those are actually in a Palantir repo, not Central. 
Deploying to Central would be uncourteous, but this approach is legitimate and 
how it has to work for vendors to release distros of Spark etc.


ASF processes are set up to stop people pushing any org.apache. artifact out to 
mvncentral without going through the signing process; if someone does, then 
that's a major problem,

On Tue, Jan 9, 2018 at 11:43 AM Nan Zhu 
> wrote:
Hi, all

Out of curious, I just found a bunch of Palantir release under org.apache.spark 
in maven central 
(https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11)?

Is it on purpose?

Best,

Nan





Re: Publishing container images for Apache Spark

2018-01-11 Thread Craig Russell
Hi,

I think your summary is spot on. I don't see further issues.

Craig

> On Jan 11, 2018, at 9:18 AM, Erik Erlandson  wrote:
> 
> Dear ASF Legal Affairs Committee,
> 
> The Apache Spark development community has begun some discussions 
> 
>  about publishing container images for Spark as part of its release process.  
> These discussions were spurred by the upstream adoption of a new Kubernetes 
> scheduling back-end, which by nature operates via container images running 
> Spark inside a Kubernetes cluster.
> 
> The current state of thinking on this topic is influenced by the LEGAL-270 
> Jira  which can be 
> summarized as:
> * A container image has the same legal status as other derived distributions
> * As such, it is legally sound to publish a container image as long as that 
> image corresponds to an official project release
> * An image that is regularly built from non-release code (e.g. a 
> 'spark:latest' image built from the head of master branch) would not be 
> legally approved
> * The image should not contain any code or binaries that carry GPL licenses, 
> or other licenses considered incompatible with ASF.
> 
> We are reaching out to you to get your additional input on what requirements 
> the community should meet to engineer Apache Spark container images that meet 
> ASF legal guidelines.
> 
> The original dev@spark thread is here:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Publishing-official-docker-images-for-KubernetesSchedulerBackend-td22928.html
>  
> 
> 
> LEGAL-270:
> https://issues.apache.org/jira/browse/LEGAL-270 
> 
> 

Craig L Russell
Secretary, Apache Software Foundation
c...@apache.org  http://db.apache.org/jdo 



Publishing container images for Apache Spark

2018-01-11 Thread Erik Erlandson
Dear ASF Legal Affairs Committee,

The Apache Spark development community has begun some discussions

about publishing container images for Spark as part of its release
process.  These discussions were spurred by the upstream adoption of a new
Kubernetes scheduling back-end, which by nature operates via container
images running Spark inside a Kubernetes cluster.

The current state of thinking on this topic is influenced by the LEGAL-270
Jira  which can be
summarized as:
* A container image has the same legal status as other derived distributions
* As such, it is legally sound to publish a container image as long as that
image corresponds to an official project release
* An image that is regularly built from non-release code (e.g. a
'spark:latest' image built from the head of master branch) would not be
legally approved
* The image should not contain any code or binaries that carry GPL
licenses, or other licenses considered incompatible with ASF.

We are reaching out to you to get your additional input on what
requirements the community should meet to engineer Apache Spark container
images that meet ASF legal guidelines.

The original dev@spark thread is here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Publishing-official-docker-images-for-KubernetesSchedulerBackend-td22928.html

LEGAL-270:
https://issues.apache.org/jira/browse/LEGAL-270


Re: Kubernetes: why use init containers?

2018-01-11 Thread Anirudh Ramanathan
If we can separate concerns those out, that might make sense in the short
term IMO.
There are several benefits to reusing spark-submit and spark-class as you
pointed out previously,
so, we should be looking to leverage those irrespective of how we do
dependency management -
in the interest of conformance with the other cluster managers.

I like the idea of passing arguments through in a way that it doesn't
trigger the dependency management code for now.
In the interest of time for 2.3, if we could target the just that (and
revisit the init containers afterwards),
there should be enough time to make the change, test and release with
confidence.

On Wed, Jan 10, 2018 at 3:45 PM, Marcelo Vanzin  wrote:

> On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan
>  wrote:
> > We can start by getting a PR going perhaps, and start augmenting the
> > integration testing to ensure that there are no surprises - with/without
> > credentials, accessing GCS, S3 etc as well.
> > When we get enough confidence and test coverage, let's merge this in.
> > Does that sound like a reasonable path forward?
>
> I think it's beneficial to separate this into two separate things as
> far as discussion goes:
>
> - using spark-submit: the code should definitely be starting the
> driver using spark-submit, and potentially the executor using
> spark-class.
>
> - separately, we can decide on whether to keep or remove init containers.
>
> Unfortunately, code-wise, those are not separate. If you get rid of
> init containers, my current p.o.c. has most of the needed changes
> (only lightly tested).
>
> But if you keep init containers, you'll need to mess with the
> configuration so that spark-submit never sees spark.jars /
> spark.files, so it doesn't trigger its dependency download code. (YARN
> does something similar, btw.) That will surely mean different changes
> in the current k8s code (which I wanted to double check anyway because
> I remember seeing some oddities related to those configs in the logs).
>
> To comment on one point made by Andrew:
> > there's almost a parallel here with spark.yarn.archive, where that
> configures the cluster (YARN) to do distribution pre-runtime
>
> That's more of a parallel to the docker image; spark.yarn.archive
> points to a jar file with Spark jars in it so that YARN can make Spark
> available to the driver / executors running in the cluster.
>
> Like the docker image, you could include other stuff that is not
> really part of standard Spark in that archive too, or even not have
> Spark at all there, if you want things to just fail. :-)
>
> --
> Marcelo
>



-- 
Anirudh Ramanathan


Call for Presentations FOSS Backstage open

2018-01-11 Thread Isabel Drost-Fromm
Hi,

As announced on Berlin Buzzwords we (that is Isabel Drost-Fromm, Stefan
Rudnitzki as well as the eventing team over at newthinking communications GmbH)
are working on a new conference in summer in Berlin. The name of this new
conference will be "FOSS Backstage". Backstage comprises all things
FOSS governance, open collaboration and how to build and manage communities
within the open source space.


Submission URL: https://foss-backstage.de/call-papers 

The event will comprise presentations on all things FOSS governance,
decentralised decision making, open collaboration. We invite you to submit talks
on the topics: FOSS project governance, collaboration, community management.
Asynchronous/ decentralised decision making.  Vendor neutrality in FOSS,
sustainable FOSS, cross team collaboration.  Dealing with poisonous people.
Project growth and hand-over. Trademarks. Strategic licensing.  While it's
primarily targeted at contributions from FOSS people, we would love to also
learn more on how typical FOSS collaboration models work well within
enterprises. Closely related topics not explicitly listed above are welcome. 

Important Dates (all dates in GMT +2)

Submission deadline: February 18th, 2018.

Conference: June, 13th/14th, 2018


High quality talks are called for, ranging from principles to practice. We are
looking for real world case studies, background on the social architecture of
specific projects and a deep dive into cross community collaboration.
Acceptance notifications will be sent out soon after the submission deadline.
Please include your name, bio and email, the title of the talk, a brief abstract
in English language.

We have drafted the submission form to allow for regular talks, each 45 min in
length. However you are free to submit your own ideas on how to support the
event: If you would like to take our attendees out to show them your favourite
bar in Berlin, please submit this offer through the CfP form.  If you are
interested in sponsoring the event (e.g. we would be happy to provide videos
after the event, free drinks for attendees as well as an after-show party),
please contact us.

Schedule and further updates on the event will be published soon on the event
web page.

Please re-distribute this CfP to people who might be interested.

 Contact us at:
 newthinking communications GmbH
 Schoenhauser Allee 6/7
 10119 Berlin, Germany
 i...@foss-backstage.de


Looking forward to meeting you all in person in summer :) I would love to see 
all those
tracks filled with lots of valuable talks on the Apache Way, on how we work,
on how the incubator works, on how being a 501(c3) influences how people get 
involved
and projects are being run, on how being a member run organisation is different,
on merit for life, on growing communities, on things gone great - and things
gone entirely wrong in the ASF's history, on how to interact with Apache
projects as a corporation and everything else you can think of.


Isabel


-- 
Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most 
likely involving some kind of mobile connection only.)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org