from:"Michael Armbrust"

Re: Chaining Spark Streaming Jobs

2017-08-23 Thread Michael Armbrust

If you use structured streaming and the file sink, you can have a
subsequent stream read using the file source.  This will maintain exactly
once processing even if there are hiccups or failures.

On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind 
wrote:

> Hello Spark Experts,
>
> I have a design question w.r.t Spark Streaming. I have a streaming job
> that consumes protocol buffer encoded real time logs from a Kafka cluster
> on premise. My spark application runs on EMR (aws) and persists data onto
> s3. Before I persist, I need to strip header and convert protobuffer to
> parquet (I use sparksql-scalapb to convert from Protobuff to
> Spark.sql.Row). I need to persist Raw logs as is. I can continue the
> enrichment on the same dataframe after persisting the raw data, however, in
> order to modularize I am planning to have a separate job which picks up the
> raw data and performs enrichment on it. Also,  I am trying to avoid all in
> 1 job as the enrichments could get project specific while raw data
> persistence stays customer/project agnostic.The enriched data is allowed to
> have some latency (few minutes)
>
> My challenge is, after persisting the raw data, how do I chain the next
> streaming job. The only way I can think of is -  job 1 (raw data)
> partitions on current date (MMDD) and within current date, the job 2
> (enrichment job) filters for records within 60s of current time and
> performs enrichment on it in 60s batches.
> Is this a good option? It seems to be error prone. When either of the jobs
> get delayed due to bursts or any error/exception this could lead to huge
> data losses and non-deterministic behavior . What are other alternatives to
> this?
>
> Appreciate any guidance in this regard.
>
> regards
> Sunita Koppar
>

Re: Question on how to get appended data from structured streaming

2017-08-20 Thread Michael Armbrust

What is your end goal?  Right now the foreach writer is the way to do
arbitrary processing on the data produced by various output modes.

On Sun, Aug 20, 2017 at 12:23 PM, Yanpeng Lin  wrote:

> Hello,
>
> I am new to Spark.
> It would be appreciated if anyone could help me understand how to get
> appended data from structured streaming. According to the document
> ,
> data stream could be treated as new rows appended to unbounded table. I
> want to know besides writing out data to external storage to get appended
> data only at every time, is there any other way to get appended data? like
> from memory directly.
>
> Here is my case. I had a Kafka source keeping publish data to Spark with
> `test` topic.
>
> val source = spark.readStream.format("kafka")
>  .option("kafka.bootstrap.servers",
> "broker:9092")
>  .option("subscribe", "test")\
>  .load()
>
> I tried that write stream with format `memory` like the following:
>
> val query = source.writeStream.format("memory")
>   .trigger(ProcessingTime("3 seconds"))
>   .queryName("tests").outputMode
> (OutputMode.Append).start()
> spark.sql("select topic, value from tests")
> The result table `tests` contains all data from the beginning of stream.
> like
>
> Trigger Time, Topic, Value
> t1 test,   1
> t1 test,   2
> t2 test,   3
> t3 test,   4
>
> By appended data I mean only the delta data after each trigger. For
> example, after trigger time t1, rows of value 1 and 2 are newly appended.
> After trigger time t2, row of value 3 will be treated as newly appended.
> And after t3, row of value 4 could be fetched as newly appended.
> I understand each appended data could be processed using `ForeachWriter`,
> but if I want to fetch all newly appended data after any trigger time,
> is there any way to do that directly from dataframe?
>
> Thanks!
> Yanpeng
>

Re: Restart streaming query spark 2.1 structured streaming

2017-08-15 Thread Michael Armbrust

See
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing

Though I think that this currently doesn't work with the console sink.

On Tue, Aug 15, 2017 at 9:40 AM, purna pradeep 
wrote:

> Hi,
>
>>
>> I'm trying to restart a streaming query to refresh cached data frame
>>
>> Where and how should I restart streaming query
>>
>
>
> val sparkSes = SparkSession
>
>   .builder
>
>   .config("spark.master", "local")
>
>   .appName("StreamingCahcePoc")
>
>   .getOrCreate()
>
>
>
> import sparkSes.implicits._
>
>
>
> val dataDF = sparkSes.readStream
>
>   .schema(streamSchema)
>
>   .csv("testData")
>
>
>
>
>
>val query = counts.writeStream
>
>   .outputMode("complete")
>
>   .format("console")
>
>   .start()
>
>
> query.awaittermination()
>
>
>
>>
>>
>>

Re: [SS] watermark, eventTime and "StreamExecution: Streaming query made progress"

2017-08-11 Thread Michael Armbrust

The point here is to tell you what watermark value was used when executing
this batch.  You don't know the new watermark until the batch is over and
we don't want to do two passes over the data.  In general the semantics of
the watermark are designed to be conservative (i.e. just because data is
older than the watermark does not mean it will be dropped, but data will
never be dropped until after it is below the watermark).

On Fri, Aug 11, 2017 at 12:23 AM, Jacek Laskowski  wrote:

> Hi,
>
> I'm curious why watermark is updated the next streaming batch after
> it's been observed [1]? The report (from
> ProgressReporter/StreamExecution) does not look right to me as
> avg/max/min are already calculated according to the watermark [2]
>
> My recommendation would be to do the update [2] in the same streaming
> batch it was observed. Why not? Please enlighten.
>
> 17/08/11 09:04:20 INFO StreamExecution: Streaming query made progress: {
>   "id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
>   "runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
>   "name" : "rates-to-console",
>   "timestamp" : "2017-08-11T07:04:20.004Z",
>   "batchId" : 1,
>   "numInputRows" : 2,
>   "inputRowsPerSecond" : 0.7601672367920943,
>   "processedRowsPerSecond" : 25.31645569620253,
>   "durationMs" : {
> "addBatch" : 48,
> "getBatch" : 6,
> "getOffset" : 0,
> "queryPlanning" : 1,
> "triggerExecution" : 79,
> "walCommit" : 23
>   },
>   "eventTime" : {
> "avg" : "2017-08-11T07:04:17.782Z",
> "max" : "2017-08-11T07:04:18.282Z",
> "min" : "2017-08-11T07:04:17.282Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   },
>
> ...
>
> 17/08/11 09:04:30 INFO StreamExecution: Streaming query made progress: {
>   "id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
>   "runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
>   "name" : "rates-to-console",
>   "timestamp" : "2017-08-11T07:04:30.003Z",
>   "batchId" : 2,
>   "numInputRows" : 10,
>   "inputRowsPerSecond" : 1.000100010001,
>   "processedRowsPerSecond" : 56.17977528089888,
>   "durationMs" : {
> "addBatch" : 147,
> "getBatch" : 6,
> "getOffset" : 0,
> "queryPlanning" : 1,
> "triggerExecution" : 178,
> "walCommit" : 22
>   },
>   "eventTime" : {
> "avg" : "2017-08-11T07:04:23.782Z",
> "max" : "2017-08-11T07:04:28.282Z",
> "min" : "2017-08-11T07:04:19.282Z",
> "watermark" : "2017-08-11T07:04:08.282Z"
>   },
>
> [1] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/streaming/
> StreamExecution.scala?utf8=%E2%9C%93#L538
> [2] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/streaming/
> ProgressReporter.scala#L257
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Question about 'Structured Streaming'

2017-08-08 Thread Michael Armbrust

>
> 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header
> that contains the 'schema' for the data, each log http/dns/etc will have
> different columns with different data types. So would I create a specific
> CSV reader inherited from the general one?  Also I'm assuming this would
> need to be in Scala/Java? (I suck at both of those :)
>

This is a good question. What I have seen others do is actually run
different streams for the different log types.  This way you can customize
the schema to the specific log type.

Even without using Scala/Java you could also use the text data source
(assuming the logs are new line delimited) and then write the parser for
each line in python.  There will be a performance penalty here though.


> 2) Dynamic Tailing: Does the CSV/TSV data sources support dynamic tailing
> and handle log rotations?
>

The file based sources work by tracking which files have been processed and
then scanning (optionally using glob patterns) for new files.  There a two
assumptions here: files are immutable when they arrive and files always
have a unique name. If files are deleted, we ignore that, so you are okay
to rotate them out.

The full pipeline that I have seen often involves the logs getting uploaded
to something like S3.  This is nice because you get atomic visibility of
files that have already been rotated.  So I wouldn't really call this
dynamically tailing, but we do support looking for new files at some
location.

Re: Question about 'Structured Streaming'

2017-08-08 Thread Michael Armbrust

Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to
read bro logs, rather than a python library.  This is likely to have much
better performance since we can do all of the parsing on the JVM without
having to flow it though an external python process.

On Tue, Aug 8, 2017 at 9:35 AM, Brian Wylie  wrote:

> Hi All,
>
> I've read the new information about Structured Streaming in Spark, looks
> super great.
>
> Resources that I've looked at
> - https://spark.apache.org/docs/latest/streaming-programming-guide.html
> - https://databricks.com/blog/2016/07/28/structured-
> streaming-in-apache-spark.html
> - https://spark.apache.org/docs/latest/streaming-custom-receivers.html
> - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/
> Structured%20Streaming%20using%20Python%20DataFrames%20API.html
>
> + YouTube videos from Spark Summit 2016/2017
>
> So finally getting to my question:
>
> I have Python code that yields a Python generator... this is a great
> streaming approach within Python. I've used it for network packet
> processing and a bunch of other stuff. I'd love to simply hook up this
> generator (that yields python dictionaries) along with a schema definition
> to create an  'unbounded DataFrame' as discussed in
> https://databricks.com/blog/2016/07/28/structured-
> streaming-in-apache-spark.html
>
> Possible approaches:
> - Make a custom receiver in Python: https://spark.apache.
> org/docs/latest/streaming-custom-receivers.html
> - Use Kafka (this is definitely possible and good but overkill for my use
> case)
> - Send data out a socket and use socketTextStream to pull back in (seems a
> bit silly to me)
> - Other???
>
> Since Python Generators so naturally fit into streaming pipelines I'd
> think that this would be straightforward to 'couple' a python generator
> into a Spark structured streaming pipeline..
>
> I've put together a small notebook just to give a concrete example
> (streaming Bro IDS network data) https://github.com/
> Kitware/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb
>
> Any thoughts/suggestions/pointers are greatly appreciated.
>
> -Brian
>
>

Re: [SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-07 Thread Michael Armbrust

I think there is really no good reason for this limitation.

On Mon, Aug 7, 2017 at 2:58 AM, Jacek Laskowski  wrote:

> Hi,
>
> While exploring checkpointing with kafka source and console sink I've
> got the exception:
>
> // today's build from the master
> scala> spark.version
> res8: String = 2.3.0-SNAPSHOT
>
> scala> val q = records.
>  |   writeStream.
>  |   format("console").
>  |   option("truncate", false).
>  |   option("checkpointLocation", "/tmp/checkpoint"). // <--
> checkpoint directory
>  |   trigger(Trigger.ProcessingTime(10.seconds)).
>  |   outputMode(OutputMode.Update).
>  |   start
> org.apache.spark.sql.AnalysisException: This query does not support
> recovering from checkpoint location. Delete /tmp/checkpoint/offsets to
> start over.;
>   at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(
> StreamingQueryManager.scala:222)
>   at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(
> StreamingQueryManager.scala:278)
>   at org.apache.spark.sql.streaming.DataStreamWriter.
> start(DataStreamWriter.scala:284)
>   ... 61 elided
>
> The "trigger" is the change
> https://issues.apache.org/jira/browse/SPARK-16116 and this line in
> particular https://github.com/apache/spark/pull/13817/files#diff-
> d35e8fce09686073f81de598ed657de7R277.
>
> Why is this needed? I can't think of a use case where console sink
> could not recover from checkpoint location (since all the information
> is available). I'm lost on it and would appreciate some help (to
> recover :))
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Thoughts on release cadence?

2017-07-31 Thread Michael Armbrust

+1, should we update https://spark.apache.org/versioning-policy.html ?

On Sun, Jul 30, 2017 at 3:34 PM, Reynold Xin  wrote:

> This is reasonable ... +1
>
>
> On Sun, Jul 30, 2017 at 2:19 AM, Sean Owen  wrote:
>
>> The project had traditionally posted some guidance about upcoming
>> releases. The last release cycle was about 6 months. What about penciling
>> in December 2017 for 2.3.0? http://spark.apache.org
>> /versioning-policy.html
>>
>
>

Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread Michael Armbrust

There are end to end examples of using Kafka in in this blog:
https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html

On Sun, Jul 23, 2017 at 7:44 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all
>
> I want to change the binary from kafka to string. Would you like help me
> please?
>
> val df = ss.readStream.format("kafka").option("kafka.bootstrap.
> server","")
> .option("subscribe","")
> .load
>
> val value = df.select("value")
>
> value.writeStream
> .outputMode("append")
> .format("console")
> .start()
> .awaitTermination()
>
>
> Above code outputs result like:
>
> ++
> |value|
> +-+
> |[61,61]|
> +-+
>
>
> 61 is character a receiced from kafka.
> I want to print [a,a] or aa.
> How should I do please?
>

Re: custom joins on dataframe

2017-07-23 Thread Michael Armbrust

>
> left.join(right, my_fuzzy_udf (left("cola"),right("cola")))
>

While this could work, the problem will be that we'll have to check every
possible combination of tuples from left and right using your UDF.  It
would be best if you could somehow partition the problem so that we could
reduce the number of comparisons.  For example, if you had a fuzzy hash
that you could do an equality check on in addition to the UDF, that would
greatly speed up the computation.

Re: Flatten JSON to multiple columns in Spark

2017-07-18 Thread Michael Armbrust

Here is an overview of how to work with complex JSON in Spark:
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
(works in streaming and batch)

On Tue, Jul 18, 2017 at 10:29 AM, Riccardo Ferrari 
wrote:

> What's against:
>
> df.rdd.map(...)
>
> or
>
> dataset.foreach()
>
> https://spark.apache.org/docs/2.0.1/api/scala/index.html#
> org.apache.spark.sql.Dataset@foreach(f:T=>Unit):Unit
>
> Best,
>
> On Tue, Jul 18, 2017 at 6:46 PM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> I've been wondering about this for awhile.
>>
>> We wanted to do something similar for generically saving thousands of
>> individual homogenous events into well formed parquet.
>>
>> Ultimately I couldn't find something I wanted to own and pushed back on
>> the requirements.
>>
>> It seems the canonical answer is that you need to 'own' the schema of the
>> json and parse it out manually and into your dataframe.  There's nothing
>> challenging about it.  Just verbose code.  If you're 'info' is a consistent
>> schema then you'll be fine.  For us it was 12 wildly diverging schemas and
>> I didn't want to own the transforms.
>>
>> I also recommend persisting anything that isn't part of your schema in an
>> 'extras field'  So when you parse out your json, if you've got anything
>> leftover drop it in there for later analysis.
>>
>> I can provide some sample code but I think it's pretty straightforward /
>> you can google it.
>>
>> What you can't seem to do efficiently is dynamically generate a dataframe
>> from random JSON.
>>
>>
>> On 18 July 2017 at 01:57, Chetan Khatri 
>> wrote:
>>
>>> Implicit tried - didn't worked!
>>>
>>> from_json - didnt support spark 2.0.1 any alternate solution would be
>>> welcome please
>>>
>>>
>>> On Tue, Jul 18, 2017 at 12:18 PM, Georg Heiler <
>>> georg.kf.hei...@gmail.com> wrote:
>>>
 You need to have spark implicits in scope
 Richard Xin  schrieb am Di. 18. Juli
 2017 um 08:45:

> I believe you could use JOLT (bazaarvoice/jolt
> ) to flatten it to a json string
> and then to dataframe or dataset.
>
> bazaarvoice/jolt
>
> jolt - JSON to JSON transformation library written in Java.
> 
>
>
>
>
> On Monday, July 17, 2017, 11:18:24 PM PDT, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>
> Explode is not working in this scenario with error - string cannot be
> used in explore either array or map in spark
> On Tue, Jul 18, 2017 at 11:39 AM, 刘虓  wrote:
>
> Hi,
> have you tried to use explode?
>
> Chetan Khatri  于2017年7月18日 周二下午2:06写道：
>
> Hello Spark Dev's,
>
> Can you please guide me, how to flatten JSON to multiple columns in
> Spark.
>
> *Example:*
>
> Sr No Title ISBN Info
> 1 Calculus Theory 1234567890
>
> [{"cert":[{
> "authSbmtr":"009415da-c8cd- 418d-869e-0a19601d79fa",
> 009415da-c8cd-418d-869e- 0a19601d79fa
> "certUUID":"03ea5a1a-5530- 4fa3-8871-9d1ebac627c4",
>
> "effDt":"2016-05-06T15:04:56. 279Z",
>
>
> "fileFmt":"rjrCsv","status":" live"}],
>
> "expdCnt":"15",
> "mfgAcctNum":"531093",
>
> "oUUID":"23d07397-4fbe-4897- 8a18-b79c9f64726c",
>
>
> "pgmRole":["RETAILER"],
> "pgmUUID":"1cb5dd63-817a-45bc- a15c-5660e4accd63",
> "regUUID":"cc1bd898-657d-40dc- af5d-4bf1569a1cc4",
> "rtlrsSbmtd":["009415da-c8cd- 418d-869e-0a19601d79fa"]}]
>
> I want to get single row with 11 columns.
>
> Thanks.
>
>
>>>
>>
>

[ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Michael Armbrust

Hi all,

Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release
removes the experimental tag from Structured Streaming. In addition, this
release focuses on usability, stability, and polish, resolving over 1100
tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 2.2.0, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes: https://spark.apache.
org/releases/spark-release-2-2-0.html

*(note: If you see any issues with the release notes, webpage or published
artifacts, please contact me directly off-list) *

Michael

[ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Michael Armbrust

Hi all,

Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release
removes the experimental tag from Structured Streaming. In addition, this
release focuses on usability, stability, and polish, resolving over 1100
tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 2.2.0, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes: https://spark.apache.
org/releases/spark-release-2-2-0.html

*(note: If you see any issues with the release notes, webpage or published
artifacts, please contact me directly off-list) *

Michael

Re: Event time aggregation is possible in Spark Streaming ?

2017-07-10 Thread Michael Armbrust

Event-time aggregation is only supported in Structured Streaming.

On Sat, Jul 8, 2017 at 4:18 AM, Swapnil Chougule 
wrote:

> Hello,
>
> I want to know whether event time aggregation in spark streaming. I could
> see it's possible in structured streaming. As I am working on conventional
> spark streaming, I need event time aggregation in it. I checked but didn't
> get any relevant documentation.
>
> Thanks in advance
>
> Regards,
> Swapnil
>

Re: Union of 2 streaming data frames

2017-07-10 Thread Michael Armbrust

As I said in the voting thread:

This vote passes! I'll followup with the release on Monday.



On Mon, Jul 10, 2017 at 10:55 AM, Lalwani, Jayesh <
jayesh.lalw...@capitalone.com> wrote:

> Michael,
>
>
>
> I see that 2.2 RC6 has passed a vote on Friday. Does this mean 2.2 is
> going to be out soon? Do you have some sort of ETA?
>
>
>
> *From: *"Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
> *Date: *Friday, July 7, 2017 at 5:46 PM
> *To: *Michael Armbrust <mich...@databricks.com>
>
> *Cc: *"user@spark.apache.org" <user@spark.apache.org>, #MM - Heartbeat <
> mm-heartb...@capitalone.com>
> *Subject: *Re: Union of 2 streaming data frames
>
>
>
> Great! Even, *val **dfAllEvents =
> sparkSession.table("oldEvents").union(sparkSession.table("newEvents")) 
> *doesn’t
> work. Will this be addressed in 2.2?
>
>
>
>
>
> *From: *Michael Armbrust <mich...@databricks.com>
> *Date: *Friday, July 7, 2017 at 5:42 PM
> *To: *"Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
> *Cc: *"user@spark.apache.org" <user@spark.apache.org>, #MM - Heartbeat <
> mm-heartb...@capitalone.com>
> *Subject: *Re: Union of 2 streaming data frames
>
>
>
> Ah, looks like you are hitting SPARK-20441
> <https://issues.apache.org/jira/browse/SPARK-20441>.  Should be fixed in
> 2.2.
>
>
>
> On Fri, Jul 7, 2017 at 2:37 PM, Lalwani, Jayesh <
> jayesh.lalw...@capitalone.com> wrote:
>
> I created a small sample code to verify this. It looks like union using
> Spark SQL doesn’t work. Calling union on dataframe works.
> https://gist.github.com/GaalDornick/8920577ca92842f44d7bfd3a277c7545. I’m
> on 2.1.0
>
>
>
> I get the following exception. If I change val dfAllEvents =
> sparkSession.sql("select * from oldEvents union select * from newEvents")
> to val dfAllEvents = dfNewEvents.union(dfOldEvents) it works fine
>
>
>
> 17/07/07 17:33:34 ERROR StreamExecution: Query [id =
> 3bae26a1-7ee3-45ab-a98d-9346eaf03d08, runId = 
> 063af01f-9878-452e-aa30-7c21e2ef4c18]
> terminated with error
>
> org.apache.spark.sql.AnalysisException: resolved attribute(s) acctId#29
> missing from 
> eventType#2,acctId#0,eventId#37L,acctId#36,eventType#38,eventId#1L
> in operator !Join Inner, (acctId#0 = acctId#29);;
>
> Distinct
>
> +- Union
>
>:- Project [acctId#0, eventId#1L, eventType#2]
>
>:  +- SubqueryAlias oldevents, `oldEvents`
>
>: +- Project [acctId#0, eventId#1L, eventType#2]
>
>   :+- !Join Inner, (acctId#0 = acctId#29)
>
>:   :- SubqueryAlias alloldevents, `allOldEvents`
>
>:   :  +- Relation[acctId#0,eventId#1L,eventType#2] json
>
>:   +- SubqueryAlias newevents, `newEvents`
>
>:  +- Relation[acctId#36,eventId#37L,eventType#38] json
>
>+- Project [acctId#29, eventId#30L, eventType#31]
>
>   +- SubqueryAlias newevents, `newEvents`
>
>  +- Relation[acctId#29,eventId#30L,eventType#31] json
>
>
>
> at org.apache.spark.sql.catalyst.
> analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>
> at org.apache.spark.sql.catalyst.analysis.Analyzer.
> failAnalysis(Analyzer.scala:57)
>
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:337)
>
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>
> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
> TreeNode.scala:128)
>
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> foreachUp$1.apply(TreeNode.scala:127)
>
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> foreachUp$1.apply(TreeNode.scala:127)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
> TreeNode.scala:127)
>
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> foreachUp$1.apply(TreeNode.scala:127)
>
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> foreachUp$1.apply(TreeNode.scala:127)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
> TreeNode.scala:127)
>
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> foreachUp$1.apply(TreeNode

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-07 Thread Michael Armbrust

This vote passes! I'll followup with the release on Monday.

+1:
Michael Armbrust (binding)
Kazuaki Ishizaki
Sean Owen (binding)
Joseph Bradley (binding)
Ricardo Almeida
Herman van Hövell tot Westerflier (binding)
Yanbo Liang
Nick Pentreath (binding)
Wenchen Fan (binding)
Sameer Agarwal
Denny Lee
Felix Cheung
Holden Karau
Dong Joon Hyun
Reynold Xin (binding)
Hyukjin Kwon
Yin Huai (binding)
Xiao Li

-1: None

On Fri, Jul 7, 2017 at 12:21 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> +1
>
> Xiao Li
>
> 2017-07-06 22:18 GMT-07:00 Yin Huai <yh...@databricks.com>:
>
>> +1
>>
>> On Thu, Jul 6, 2017 at 8:40 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>>> +1
>>>
>>> 2017-07-07 6:41 GMT+09:00 Reynold Xin <r...@databricks.com>:
>>>
>>>> +1
>>>>
>>>>
>>>> On Fri, Jun 30, 2017 at 6:44 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.2.0-rc6
>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc6> (a2c7b2133cfee7f
>>>>> a9abfaa2bfbfb637155466783)
>>>>>
>>>>> List of JIRA tickets resolved can be found with this filter
>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>>>>> .
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>> spark-1245/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://people.apache.org/~pwendell/spark-releases/spark-2.2
>>>>> .0-rc6-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>
>>>>> *But my bug isn't fixed!??!*
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Union of 2 streaming data frames

2017-07-07 Thread Michael Armbrust

alysis.Analyzer.
> checkAnalysis(Analyzer.scala:57)
>
> at org.apache.spark.sql.execution.QueryExecution.
> assertAnalyzed(QueryExecution.scala:48)
>
> at org.apache.spark.sql.execution.QueryExecution.
> withCachedData$lzycompute(QueryExecution.scala:68)
>
> at org.apache.spark.sql.execution.QueryExecution.
> withCachedData(QueryExecution.scala:67)
>
> at org.apache.spark.sql.execution.streaming.
> IncrementalExecution.optimizedPlan$lzycompute(
> IncrementalExecution.scala:60)
>
> at org.apache.spark.sql.execution.streaming.
> IncrementalExecution.optimizedPlan(IncrementalExecution.scala:60)
>
> at org.apache.spark.sql.execution.QueryExecution.
> sparkPlan$lzycompute(QueryExecution.scala:79)
>
> at org.apache.spark.sql.execution.QueryExecution.
> sparkPlan(QueryExecution.scala:75)
>
> at org.apache.spark.sql.execution.QueryExecution.
> executedPlan$lzycompute(QueryExecution.scala:84)
>
> at org.apache.spark.sql.execution.QueryExecution.
> executedPlan(QueryExecution.scala:84)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anonfun$org$apache$spark$sql$execution$
> streaming$StreamExecution$$runBatch$3.apply(StreamExecution.scala:496)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anonfun$org$apache$spark$sql$execution$
> streaming$StreamExecution$$runBatch$3.apply(StreamExecution.scala:488)
>
> at org.apache.spark.sql.execution.streaming.
> ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution.reportTimeTaken(StreamExecution.scala:46)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution.org$apache$spark$sql$execution$streaming$
> StreamExecution$$runBatch(StreamExecution.scala:488)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anonfun$org$apache$spark$sql$execution$
> streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$
> mcV$sp(StreamExecution.scala:255)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anonfun$org$apache$spark$sql$execution$
> streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(
> StreamExecution.scala:244)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anonfun$org$apache$spark$sql$execution$
> streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(
> StreamExecution.scala:244)
>
> at org.apache.spark.sql.execution.streaming.
> ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution.reportTimeTaken(StreamExecution.scala:46)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anonfun$org$apache$spark$sql$execution$
> streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(
> StreamExecution.scala:244)
>
> at org.apache.spark.sql.execution.streaming.
> ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution.org$apache$spark$sql$execution$streaming$
> StreamExecution$$runBatches(StreamExecution.scala:239)
>
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anon$1.run(StreamExecution.scala:177)
>
>
>
>
>
>
>
>
>
> *From: *Michael Armbrust <mich...@databricks.com>
> *Date: *Friday, July 7, 2017 at 2:30 PM
> *To: *"Lalwani, Jayesh" <jayesh.lalw...@capitalone.com>
> *Cc: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Re: Union of 2 streaming data frames
>
>
>
> df.union(df2) should be supported when both DataFrames are created from a
> streaming source.  What error are you seeing?
>
>
>
> On Fri, Jul 7, 2017 at 11:27 AM, Lalwani, Jayesh <
> jayesh.lalw...@capitalone.com> wrote:
>
> In structured streaming, Is there a way to Union 2 streaming data frames?
> Are there any plans to support Union of 2 streaming dataframes soon? I can
> understand the inherent complexity in joining 2 streaming data frames. But,
> Union is  just concatenating 2 microbatches, innit?
>
>
>
> The problem that we are trying to solve is that we have a Kafka stream
> that is receiving events. Each event is assosciated with an account ID. We
> have a data store that stores historical  events for hundreds of millions
> of accounts. What we want to do is for the events coming in the input
> stre

[jira] [Updated] (SPARK-20441) Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation

2017-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20441:
-
Affects Version/s: (was: 2.2.0)

> Within the same streaming query, one StreamingRelation should only be 
> transformed to one StreamingExecutionRelation
> ---
>
> Key: SPARK-20441
> URL: https://issues.apache.org/jira/browse/SPARK-20441
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Liwei Lin
> Fix For: 2.2.0
>
>
> Within the same streaming query, when one StreamingRelation is referred 
> multiple times -- e.g. df.union(df) -- we should transform it only to one 
> StreamingExecutionRelation, instead of two or more different  
> StreamingExecutionRelations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20441) Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation

2017-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20441:
-
Fix Version/s: 2.2.0

> Within the same streaming query, one StreamingRelation should only be 
> transformed to one StreamingExecutionRelation
> ---
>
> Key: SPARK-20441
> URL: https://issues.apache.org/jira/browse/SPARK-20441
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Liwei Lin
> Fix For: 2.2.0
>
>
> Within the same streaming query, when one StreamingRelation is referred 
> multiple times -- e.g. df.union(df) -- we should transform it only to one 
> StreamingExecutionRelation, instead of two or more different  
> StreamingExecutionRelations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Union of 2 streaming data frames

2017-07-07 Thread Michael Armbrust

df.union(df2) should be supported when both DataFrames are created from a
streaming source.  What error are you seeing?

On Fri, Jul 7, 2017 at 11:27 AM, Lalwani, Jayesh <
jayesh.lalw...@capitalone.com> wrote:

> In structured streaming, Is there a way to Union 2 streaming data frames?
> Are there any plans to support Union of 2 streaming dataframes soon? I can
> understand the inherent complexity in joining 2 streaming data frames. But,
> Union is  just concatenating 2 microbatches, innit?
>
>
>
> The problem that we are trying to solve is that we have a Kafka stream
> that is receiving events. Each event is assosciated with an account ID. We
> have a data store that stores historical  events for hundreds of millions
> of accounts. What we want to do is for the events coming in the input
> stream, we want to add in all the historical events from the data store and
> give it to a model.
>
>
>
> Initially, the way we were planning to do this is
> a) read from Kafka into a streaming dataframe. Call this inputDF.
> b) In a mapWithPartition method, get all the unique accounts in the
> partition. Look up all the historical events for those unique accounts and
> return them. Let’s call this historicalDF
>
> c) Union inputDF with historicalDF. Call this allDF
>
> d) Call mapWithPartition on allDF and give the records to the model
>
>
>
> Of course, this doesn’t work because both inputDF and historicalDF are
> streaming data frames.
>
>
>
> What we ended up doing is in step b) we output the input records with the
> historical records, which works but seems like a hacky way of doing things.
> The operation that does lookup does union too. This works for now because
> the data from the data store doesn’t require any transformation or
> aggregation. But, if it did, we would like to do that using Spark SQL,
> whereas this solution forces us to doing any transformation of historical
> data in Scala
>
>
>
> Is there a Sparky way of doing this?
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Re: If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread Michael Armbrust

It goes through the same optimization pipeline.  More in this video
.

On Thu, Jul 6, 2017 at 5:28 PM, kant kodali  wrote:

> HI All,
>
> I am wondering If I pass a raw SQL string to dataframe do I still get the
> Spark SQL optimizations? why or why not?
>
> Thanks!
>

[jira] [Updated] (SPARK-21267) Improvements to the Structured Streaming programming guide

2017-07-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21267:
-
Target Version/s:   (was: 2.2.0)

> Improvements to the Structured Streaming programming guide
> --
>
> Key: SPARK-21267
> URL: https://issues.apache.org/jira/browse/SPARK-21267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> - Add information for Ganglia
> - Add Kafka Sink to the main docs
> - Move Structured Streaming above Spark Streaming



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-06-30 Thread Michael Armbrust

I'll kick off the vote with a +1.

On Fri, Jun 30, 2017 at 6:44 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc6
> <https://github.com/apache/spark/tree/v2.2.0-rc6> (a2c7b2133cfee7f
> a9abfaa2bfbfb637155466783)
>
> List of JIRA tickets resolved can be found with this filter
> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
> .
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1245/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>

[VOTE] Apache Spark 2.2.0 (RC6)

2017-06-30 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.2.0-rc6
 (
a2c7b2133cfee7fa9abfaa2bfbfb637155466783)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1245/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-06-30 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070525#comment-16070525
 ] 

Michael Armbrust commented on SPARK-18057:
--

We should upgrade.  Now that Kafka has a good protocol versioning story, I also 
wonder if we should get rid of the version in our artifacts entirely.  When we 
upgrade it would also be good if we can add the new headers to the row that we 
output.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Interesting Stateful Streaming question

2017-06-30 Thread Michael Armbrust

This does sound like a good use case for that feature.  Note that Spark
2.2. adds a similar [flat]MapGroupsWithState operation to structured
streaming.  Stay tuned for a blog post on that!

On Thu, Jun 29, 2017 at 6:11 PM, kant kodali  wrote:

> Is mapWithState an answer for this ? https://databricks.com/blog/
> 2016/02/01/faster-stateful-stream-processing-in-apache-
> spark-streaming.html
>
> On Thu, Jun 29, 2017 at 11:55 AM, kant kodali  wrote:
>
>> Hi All,
>>
>> Here is a problem and I am wondering if Spark Streaming is the right tool
>> for this ?
>>
>> I have stream of messages m1, m2, m3and each of those messages can be
>> in state s1, s2, s3,sn (you can imagine the number of states are about
>> 100) and I want to compute some metrics that visit all the states from s1
>> to sn but these state transitions can happen at indefinite amount of
>> time. A simple example of that would be count all messages that visited
>> state s1, s2, s3. Other words, the transition function should know that say
>> message m1 had visited state s1 and s2 but not s3 yet and once the message
>> m1 visits s3 increment the counter +=1 .
>>
>> If it makes anything easier I can say a message has to visit s1 before
>> visiting s2 and s2 before visiting s3 and so on but would like to know both
>> with and without order.
>>
>> Thanks!
>>
>>
>

[jira] [Commented] (SPARK-15533) Deprecate Dataset.explode

2017-06-30 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070308#comment-16070308
 ] 

Michael Armbrust commented on SPARK-15533:
--

Just include the other columns too {{df.select($"id", explode($"val"))}}

> Deprecate Dataset.explode
> -
>
> Key: SPARK-15533
> URL: https://issues.apache.org/jira/browse/SPARK-15533
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> See discussion on the mailing list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201605.mbox/browser
> We should deprecate Dataset.explode, and point users to Dataset.flatMap and 
> functions.explode with select.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21253:
-
Target Version/s: 2.2.0

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93

[jira] [Assigned] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-21253:


Assignee: Shixiong Zhu

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-26 Thread Michael Armbrust

Okay, this vote fails.  Following with RC6 shortly.

On Wed, Jun 21, 2017 at 12:51 PM, Imran Rashid <iras...@cloudera.com> wrote:

> -1
>
> I'm sorry for discovering this so late, but I just filed
> https://issues.apache.org/jira/browse/SPARK-21165 which I think should be
> a blocker, its a regression from 2.1
>
> On Wed, Jun 21, 2017 at 1:43 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> As before, release looks good, all Scala, Python tests pass. R tests fail
>> with same issue in SPARK-21093 but it's not a blocker.
>>
>> +1 (binding)
>>
>>
>> On Wed, 21 Jun 2017 at 01:49 Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> I will kick off the voting with a +1.
>>>
>>> On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00
>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>>
>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>
>>>> The tag to be voted on is v2.2.0-rc5
>>>> <https://github.com/apache/spark/tree/v2.2.0-rc5> (62e442e73a2fa66
>>>> 3892d2edaff5f7d72d7f402ed)
>>>>
>>>> List of JIRA tickets resolved can be found with this filter
>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>>>> .
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1243/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://people.apache.org/~pwendell/spark-releases/spark-2.
>>>> 2.0-rc5-docs/
>>>>
>>>>
>>>> *FAQ*
>>>>
>>>> *How can I help test this release?*
>>>>
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>
>>>> *But my bug isn't fixed!??!*
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from 2.1.1.
>>>>
>>>
>>>
>

[jira] [Commented] (SPARK-21110) Structs should be usable in inequality filters

2017-06-23 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061398#comment-16061398
 ] 

Michael Armbrust commented on SPARK-21110:
--

It seems if you can call {{min}} and {{max}} on structs you should be able to 
use comparison operations as well.

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct<city:string,person:string>;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21110) Structs should be usable in inequality filters

2017-06-23 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21110:
-
Target Version/s: 2.3.0

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct<city:string,person:string>;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Michael Armbrust

rk/tree/v2.2.0-rc4,
>>>>>>
>>>>>> 1. Windows Server 2012 R2 / R 3.3.1 - passed (
>>>>>> https://ci.appveyor.com/project/spark-test/spark/
>>>>>> build/755-r-test-v2.2.0-rc4)
>>>>>> 2. macOS Sierra 10.12.3 / R 3.4.0 - passed
>>>>>> 3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
>>>>>> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>>>>> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (https://gist.github.com/
>>>>>> HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>>>>>>
>>>>>>
>>>>>> Per https://github.com/apache/spark/tree/v2.1.1,
>>>>>>
>>>>>> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (https://gist.github.com/
>>>>>> HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)
>>>>>>
>>>>>>
>>>>>> This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my
>>>>>> tests and observations.
>>>>>>
>>>>>> This is failed in Spark 2.1.1. So, it sounds not a regression
>>>>>> although it is a bug that should be fixed (whether in Spark or R).
>>>>>>
>>>>>>
>>>>>> 2017-06-14 8:28 GMT+09:00 Xiao Li <gatorsm...@gmail.com>:
>>>>>>
>>>>>>> -1
>>>>>>>
>>>>>>> Spark 2.2 is unable to read the partitioned table created by Spark
>>>>>>> 2.1 or earlier.
>>>>>>>
>>>>>>> Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085
>>>>>>>
>>>>>>> Will fix it soon.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Xiao Li
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-06-13 9:39 GMT-07:00 Joseph Bradley <jos...@databricks.com>:
>>>>>>>
>>>>>>>> Re: the QA JIRAs:
>>>>>>>> Thanks for discussing them.  I still feel they are very helpful; I
>>>>>>>> particularly notice not having to spend a solid 2-3 weeks of time QAing
>>>>>>>> (unlike in earlier Spark releases).  One other point not mentioned 
>>>>>>>> above: I
>>>>>>>> think they serve as a very helpful reminder/training for the community 
>>>>>>>> for
>>>>>>>> rigor in development.  Since we instituted QA JIRAs, contributors have 
>>>>>>>> been
>>>>>>>> a lot better about adding in docs early, rather than waiting until the 
>>>>>>>> end
>>>>>>>> of the cycle (though I know this is drawing conclusions from 
>>>>>>>> correlations).
>>>>>>>>
>>>>>>>> I would vote in favor of the RC...but I'll wait to see about the
>>>>>>>> reported failures.
>>>>>>>>
>>>>>>>> On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen <so...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Different errors as in https://issues.apache.org/
>>>>>>>>> jira/browse/SPARK-20520 but that's also reporting R test
>>>>>>>>> failures.
>>>>>>>>>
>>>>>>>>> I went back and tried to run the R tests and they passed, at least
>>>>>>>>> on Ubuntu 17 / R 3.3.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath <
>>>>>>>>> nick.pentre...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> All Scala, Python tests pass. ML QA and doc issues are resolved
>>>>>>>>>> (as well as R it seems).
>>>>>>>>>>
>>>>>>>>>> However, I'm seeing the following test failure on R consistently:
>>>>>>>>>> https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 8 Jun 2017 at 08:48 Denny Lee <denny.g@gmail.com>
>>>>>>>>>&

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-20 Thread Michael Armbrust

I will kick off the voting with a +1.

On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc5
> <https://github.com/apache/spark/tree/v2.2.0-rc5> (62e442e73a2fa66
> 3892d2edaff5f7d72d7f402ed)
>
> List of JIRA tickets resolved can be found with this filter
> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
> .
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1243/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>

[VOTE] Apache Spark 2.2.0 (RC5)

2017-06-20 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.2.0-rc5
 (
62e442e73a2fa663892d2edaff5f7d72d7f402ed)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1243/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

Re: org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Michael Armbrust

It's in the spark-catalyst_2.11-2.1.1.jar since the logical query plans and
optimization also need to know about types.

On Tue, Jun 20, 2017 at 1:14 PM, Jean Georges Perrin  wrote:

> Hey all,
>
> i was giving a run to 2.1.1 and got an error on one of my test program:
>
> package net.jgp.labs.spark.l000_ingestion;
>
> import java.util.Arrays;
> import java.util.List;
>
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.IntegerType;
>
> public class ArrayToDataset {
>
> public static void main(String[] args) {
> ArrayToDataset app = new ArrayToDataset();
> app.start();
> }
>
> private void start() {
> SparkSession spark = SparkSession.builder().appName("Array to Dataset"
> ).master("local").getOrCreate();
>
> Integer[] l = new Integer[] { 1, 2, 3, 4, 5, 6, 7 };
> List data = Arrays.asList(l);
> Dataset df = spark.createDataFrame(data, IntegerType.class);
>
> df.show();
> }
> }
>
> Eclipse is complaining that it cannot find 
> org.apache.spark.sql.types.IntegerType
> and after looking in the spark-sql_2.11-2.1.1.jar jar, I could not find it
> as well:
>
> I looked at the 2.1.1 release notes as well, did not see anything. The
> package is still in Javadoc: https://spark.apache.
> org/docs/latest/api/java/org/apache/spark/sql/types/package-summary.html
>
> I must be missing something. Any hint?
>
> Thanks!
>
> jg
>
>
>
>
>
>

Re: how many topics spark streaming can handle

2017-06-19 Thread Michael Armbrust

I don't think that there is really a Spark specific limit here.  It would
be a function of the size of your spark / kafka clusters and the type of
processing you are trying to do.

On Mon, Jun 19, 2017 at 12:00 PM, Ashok Kumar 
wrote:

> Hi Gurus,
>
> Within one Spark streaming process how many topics can be handled? I have
> not tried more than one topic.
>
> Thanks
>

Re: cannot call explain or show on dataframe in structured streaming addBatch dataframe

2017-06-19 Thread Michael Armbrust

There is a little bit of weirdness to how we override the default query
planner to replace it with an incrementalizing planner.  As such, calling
any operation that changes the query plan (such as a LIMIT) would cause it
to revert to the batch planner and return the wrong answer.  We should fix
this before the finalize the Sink API.

On Mon, Jun 19, 2017 at 9:32 AM, assaf.mendelson 
wrote:

> Hi all,
>
> I am playing around with structured streaming and looked at the code for
> ConsoleSink.
>
>
>
> I see the code has:
>
>
>
> data.sparkSession.createDataFrame(
> data.sparkSession.sparkContext.parallelize(data.collect()), data.schema)
> .show(*numRowsToShow*, *isTruncated*)
> }
>
>
>
> I was wondering why it does not do data directly? Why the collect and
> parallelize?
>
>
>
>
>
> Thanks,
>
>   Assaf.
>
>
>
> --
> View this message in context: cannot call explain or show on dataframe in
> structured streaming addBatch dataframe
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>

Re: the scheme in stream reader

2017-06-19 Thread Michael Armbrust

The socket source can't know how to parse your data.  I think the right
thing would be for it to throw an exception saying that you can't set the
schema here.  Would you mind opening a JIRA ticket?

If you are trying to parse data from something like JSON then you should
use from_json` on the value returned.

On Sun, Jun 18, 2017 at 12:27 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> L set the scheme for  DataStreamReader but when I print the scheme.It just
> printed:
> root
> |--value:string (nullable=true)
>
> My code is
>
> val line = ss.readStream.format("socket")
> .option("ip",xxx)
> .option("port",xxx)
> .scheme(StructField("name",StringType）::(StructField("age",
> IntegerType))).load
> line.printSchema
>
> My spark version is 2.1.0.
> I want the printSchema prints the schema I set in the code.How should I do
> please?
> And my original target is the received data from socket is handled as
> schema directly.What should I do please?
>
> thanks
> Fei Shao
>
>
>
>
>
>
>

Re: the scheme in stream reader

2017-06-19 Thread Michael Armbrust

The socket source can't know how to parse your data.  I think the right
thing would be for it to throw an exception saying that you can't set the
schema here.  Would you mind opening a JIRA ticket?

If you are trying to parse data from something like JSON then you should
use from_json` on the value returned.

On Sun, Jun 18, 2017 at 12:27 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> L set the scheme for  DataStreamReader but when I print the scheme.It just
> printed:
> root
> |--value:string (nullable=true)
>
> My code is
>
> val line = ss.readStream.format("socket")
> .option("ip",xxx)
> .option("port",xxx)
> .scheme(StructField("name",StringType）::(StructField("age",
> IntegerType))).load
> line.printSchema
>
> My spark version is 2.1.0.
> I want the printSchema prints the schema I set in the code.How should I do
> please?
> And my original target is the received data from socket is handled as
> schema directly.What should I do please?
>
> thanks
> Fei Shao
>
>
>
>
>
>
>

[jira] [Updated] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21133:
-
Target Version/s: 2.2.0
Priority: Blocker  (was: Major)
 Description: 
Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
simple:

{code:sql}
spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
  set spark.sql.shuffle.partitions=2001;
  drop table if exists spark_hcms_npe;
  create table spark_hcms_npe as select id, count(*) from big_table group by id;
"
{code}

Error logs:
{noformat}
17/06/18 15:00:27 ERROR Utils: Exception encountered
java.lang.NullPointerException
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
... 17

[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-06-19 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054596#comment-16054596
 ] 

Michael Armbrust commented on SPARK-20928:
--

Hi Cody, I do plan to flesh this out with the other sections of the SIP 
document and will email the dev list at that point.  All that has been done so 
far is some basic prototyping to estimate how much work an alternative 
{{StreamExecution}} would take to build, and some experiments to validate the 
latencies that this arch could achieve.  Do you have specific concerns with the 
proposal as it stands?

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust

You might also try with a newer version.  Several instance of code
generation failures have been fixed since 2.0.

On Thu, Jun 15, 2017 at 1:15 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi Michael,
> Spark 2.0.2 - but I have a very interesting test case actually
> The optimiser seems to be at fault in a way, I've joined to this email the
> explain when I limit myself to 2 levels of struct mutation and when it goes
> to 5.
> As you can see the optimiser seems to be doing a lot more in the later
> case.
> After further investigation, the code is not "failing" per se - spark is
> trying the whole stage codegen, the compilation is failing due to the
> compilation error and I think it's falling back to the "non codegen" way.
>
> I'll try to create a simpler test case to reproduce this if I can, what do
> you think ?
>
> Regards,
>
> Olivier.
>
>
> 2017-06-15 21:08 GMT+02:00 Michael Armbrust <mich...@databricks.com>:
>
>> Which version of Spark?  If its recent I'd open a JIRA.
>>
>> On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> when we create recursive calls to "struct" (up to 5 levels) for
>>> extending a complex datastructure we end up with the following compilation
>>> error :
>>>
>>> org.codehaus.janino.JaninoRuntimeException: Code of method
>>> "(I[Lscala/collection/Iterator;)V" of class
>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator"
>>> grows beyond 64 KB
>>>
>>> The CreateStruct code itself is properly using the ctx.splitExpression
>>> command but the "end result" of the df.select( struct(struct(struct()
>>> ))) ends up being too much.
>>>
>>> Should I open a JIRA or is there a workaround ?
>>>
>>> Regards,
>>>
>>> --
>>> *Olivier Girardot* | Associé
>>> o.girar...@lateral-thoughts.com
>>>
>>
>>
>
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust

You might also try with a newer version.  Several instance of code
generation failures have been fixed since 2.0.

On Thu, Jun 15, 2017 at 1:15 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi Michael,
> Spark 2.0.2 - but I have a very interesting test case actually
> The optimiser seems to be at fault in a way, I've joined to this email the
> explain when I limit myself to 2 levels of struct mutation and when it goes
> to 5.
> As you can see the optimiser seems to be doing a lot more in the later
> case.
> After further investigation, the code is not "failing" per se - spark is
> trying the whole stage codegen, the compilation is failing due to the
> compilation error and I think it's falling back to the "non codegen" way.
>
> I'll try to create a simpler test case to reproduce this if I can, what do
> you think ?
>
> Regards,
>
> Olivier.
>
>
> 2017-06-15 21:08 GMT+02:00 Michael Armbrust <mich...@databricks.com>:
>
>> Which version of Spark?  If its recent I'd open a JIRA.
>>
>> On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> when we create recursive calls to "struct" (up to 5 levels) for
>>> extending a complex datastructure we end up with the following compilation
>>> error :
>>>
>>> org.codehaus.janino.JaninoRuntimeException: Code of method
>>> "(I[Lscala/collection/Iterator;)V" of class
>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator"
>>> grows beyond 64 KB
>>>
>>> The CreateStruct code itself is properly using the ctx.splitExpression
>>> command but the "end result" of the df.select( struct(struct(struct()
>>> ))) ends up being too much.
>>>
>>> Should I open a JIRA or is there a workaround ?
>>>
>>> Regards,
>>>
>>> --
>>> *Olivier Girardot* | Associé
>>> o.girar...@lateral-thoughts.com
>>>
>>
>>
>
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust

Which version of Spark?  If its recent I'd open a JIRA.

On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> when we create recursive calls to "struct" (up to 5 levels) for extending
> a complex datastructure we end up with the following compilation error :
>
> org.codehaus.janino.JaninoRuntimeException: Code of method
> "(I[Lscala/collection/Iterator;)V" of class "org.apache.spark.sql.
> catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
>
> The CreateStruct code itself is properly using the ctx.splitExpression
> command but the "end result" of the df.select( struct(struct(struct()
> ))) ends up being too much.
>
> Should I open a JIRA or is there a workaround ?
>
> Regards,
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
>

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust

Which version of Spark?  If its recent I'd open a JIRA.

On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> when we create recursive calls to "struct" (up to 5 levels) for extending
> a complex datastructure we end up with the following compilation error :
>
> org.codehaus.janino.JaninoRuntimeException: Code of method
> "(I[Lscala/collection/Iterator;)V" of class "org.apache.spark.sql.
> catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
>
> The CreateStruct code itself is properly using the ctx.splitExpression
> command but the "end result" of the df.select( struct(struct(struct()
> ))) ends up being too much.
>
> Should I open a JIRA or is there a workaround ?
>
> Regards,
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
>

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-15 Thread Michael Armbrust

Continuous processing is still a work in progress.  I would really like to
at least have a basic version in Spark 2.3.

The announcement about 2.2 is that we are planning to remove the
experimental tag from Structured Streaming.

On Thu, Jun 15, 2017 at 11:53 AM, kant kodali <kanth...@gmail.com> wrote:

> vow! you caught the 007!  Is continuous processing mode available in 2.2?
> The ticket says the target version is 2.3 but the talk in the Video says
> 2.2 and beyond so I am just curious if it is available in 2.2 or should I
> try it from the latest build?
>
> Thanks!
>
> On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> This a good question. I really like using Kafka as a centralized source
>> for streaming data in an organization and, with Spark 2.2, we have full
>> support for reading and writing data to/from Kafka in both streaming and
>> batch
>> <https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>.
>> I'll focus here on what I think the advantages are of Structured Streaming
>> over Kafka Streams (a stream processing library that reads from Kafka).
>>
>>  - *High level productive APIs* - Streaming queries in Spark can be
>> expressed using DataFrames, Datasets or even plain SQL.  Streaming
>> DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
>> that for common operations like filtering, joining, aggregating, you can
>> use built-in operations.  For complicated custom logic you can use UDFs and
>> lambda functions. In contrast, Kafka Streams mostly requires you to express
>> your transformations using lambda functions.
>>  - *High Performance* - Since it is built on Spark SQL, streaming
>> queries take advantage of the Catalyst optimizer and the Tungsten execution
>> engine. This design leads to huge performance wins
>> <https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>,
>> which means you need less hardware to accomplish the same job.
>>  - *Ecosystem* - Spark has connectors for working with all kinds of data
>> stored in a variety of systems.  This means you can join a stream with data
>> encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
>> also means that if you decide that you don't want to manage a Kafka cluster
>> anymore and would rather use Kinesis, you can do that too.  We recently
>> moved a bunch of our pipelines from Kafka to Kinesis and had to only change
>> a few lines of code! I think its likely that in the future Spark will also
>> have connectors for Google's PubSub and Azure's streaming offerings.
>>
>> Regarding latency, there has been a lot of discussion about the inherent
>> latencies of micro-batch.  Fortunately, we were very careful to leave
>> batching out of the user facing API, and as we demo'ed last week, this
>> makes it possible for the Spark Streaming to achieve sub-millisecond
>> latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>.  Watch
>> SPARK-20928 <https://issues.apache.org/jira/browse/SPARK-20928> for more
>> on this effort to eliminate micro-batch from Spark's execution model.
>>
>> At the far other end of the latency spectrum...  For those with jobs that
>> run in the cloud on data that arrives sporadically, you can run streaming
>> jobs that only execute every few hours or every few days, shutting the
>> cluster down in between.  This architecture can result in a huge cost
>> savings for some applications
>> <https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html>
>> .
>>
>> Michael
>>
>> On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am trying hard to figure out what is the real difference between Kafka
>>> Streaming vs Spark Streaming other than saying one can be used as part of
>>> Micro services (since Kafka streaming is just a library) and the other is a
>>> Standalone framework by itself.
>>>
>>> If I can accomplish same job one way or other this is a sort of a
>>> puzzling question for me so it would be great to know what Spark streaming
>>> can do that Kafka Streaming cannot do efficiently or whatever ?
>>>
>>> Thanks!
>>>
>>>
>>
>

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-14 Thread Michael Armbrust

This a good question. I really like using Kafka as a centralized source for
streaming data in an organization and, with Spark 2.2, we have full support
for reading and writing data to/from Kafka in both streaming and batch
.
I'll focus here on what I think the advantages are of Structured Streaming
over Kafka Streams (a stream processing library that reads from Kafka).

 - *High level productive APIs* - Streaming queries in Spark can be
expressed using DataFrames, Datasets or even plain SQL.  Streaming
DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
that for common operations like filtering, joining, aggregating, you can
use built-in operations.  For complicated custom logic you can use UDFs and
lambda functions. In contrast, Kafka Streams mostly requires you to express
your transformations using lambda functions.
 - *High Performance* - Since it is built on Spark SQL, streaming queries
take advantage of the Catalyst optimizer and the Tungsten execution engine.
This design leads to huge performance wins
,
which means you need less hardware to accomplish the same job.
 - *Ecosystem* - Spark has connectors for working with all kinds of data
stored in a variety of systems.  This means you can join a stream with data
encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
also means that if you decide that you don't want to manage a Kafka cluster
anymore and would rather use Kinesis, you can do that too.  We recently
moved a bunch of our pipelines from Kafka to Kinesis and had to only change
a few lines of code! I think its likely that in the future Spark will also
have connectors for Google's PubSub and Azure's streaming offerings.

Regarding latency, there has been a lot of discussion about the inherent
latencies of micro-batch.  Fortunately, we were very careful to leave
batching out of the user facing API, and as we demo'ed last week, this
makes it possible for the Spark Streaming to achieve sub-millisecond
latencies .  Watch SPARK-20928
 for more on this effort
to eliminate micro-batch from Spark's execution model.

At the far other end of the latency spectrum...  For those with jobs that
run in the cloud on data that arrives sporadically, you can run streaming
jobs that only execute every few hours or every few days, shutting the
cluster down in between.  This architecture can result in a huge cost
savings for some applications

.

Michael

On Sun, Jun 11, 2017 at 1:12 AM, kant kodali  wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-14 Thread Michael Armbrust

ub.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>>>> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (
>>>>> https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>>>>>
>>>>>
>>>>> Per https://github.com/apache/spark/tree/v2.1.1,
>>>>>
>>>>> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (
>>>>> https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)
>>>>>
>>>>>
>>>>> This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my
>>>>> tests and observations.
>>>>>
>>>>> This is failed in Spark 2.1.1. So, it sounds not a regression although
>>>>> it is a bug that should be fixed (whether in Spark or R).
>>>>>
>>>>>
>>>>> 2017-06-14 8:28 GMT+09:00 Xiao Li <gatorsm...@gmail.com>:
>>>>>
>>>>>> -1
>>>>>>
>>>>>> Spark 2.2 is unable to read the partitioned table created by Spark
>>>>>> 2.1 or earlier.
>>>>>>
>>>>>> Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085
>>>>>>
>>>>>> Will fix it soon.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Xiao Li
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-06-13 9:39 GMT-07:00 Joseph Bradley <jos...@databricks.com>:
>>>>>>
>>>>>>> Re: the QA JIRAs:
>>>>>>> Thanks for discussing them.  I still feel they are very helpful; I
>>>>>>> particularly notice not having to spend a solid 2-3 weeks of time QAing
>>>>>>> (unlike in earlier Spark releases).  One other point not mentioned 
>>>>>>> above: I
>>>>>>> think they serve as a very helpful reminder/training for the community 
>>>>>>> for
>>>>>>> rigor in development.  Since we instituted QA JIRAs, contributors have 
>>>>>>> been
>>>>>>> a lot better about adding in docs early, rather than waiting until the 
>>>>>>> end
>>>>>>> of the cycle (though I know this is drawing conclusions from 
>>>>>>> correlations).
>>>>>>>
>>>>>>> I would vote in favor of the RC...but I'll wait to see about the
>>>>>>> reported failures.
>>>>>>>
>>>>>>> On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen <so...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Different errors as in https://issues.apache.org/j
>>>>>>>> ira/browse/SPARK-20520 but that's also reporting R test failures.
>>>>>>>>
>>>>>>>> I went back and tried to run the R tests and they passed, at least
>>>>>>>> on Ubuntu 17 / R 3.3.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath <
>>>>>>>> nick.pentre...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> All Scala, Python tests pass. ML QA and doc issues are resolved
>>>>>>>>> (as well as R it seems).
>>>>>>>>>
>>>>>>>>> However, I'm seeing the following test failure on R consistently:
>>>>>>>>> https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 8 Jun 2017 at 08:48 Denny Lee <denny.g@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 non-binding
>>>>>>>>>>
>>>>>>>>>> Tested on macOS Sierra, Ubuntu 16.04
>>>>>>>>>> test suite includes various test cases including Spark SQL, ML,
>>>>>>>>>> GraphFrames, Structured Streaming
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 7, 2017 at 9:40 PM vaquar khan <vaquar.k...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 non-bi

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust

Apologies for messing up the https urls.  My mistake.  I'll try to get it
right next time.

Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
knowing that they were unlikely to pass.  That said, I still think these
early RCs are valuable. I know several users that wanted to test new
features in 2.2 that have used them.  Now, if we would prefer to call them
preview or RC0 or something I'd be okay with that as well.

Regarding doc updates, I don't think it is a requirement that they be voted
on as part of the release.  Even if they are something version specific.  I
think we have regularly updated the website with documentation that was
merged after the release.

I personally don't think the QA umbrella JIRAs are particularly effective,
but I also wouldn't ban their use if others think they are.  However, I do
think that real QA needs an RC to test, so I think it is fine that there is
still outstanding QA to be done when an RC is cut.  For example, I plan to
run a bunch of streaming workloads on RC4 and will vote accordingly.

TLDR; Based on what I have heard from everyone so far, there are currently
no know issues that should fail the vote here.  We should begin testing
RC4.  Thanks to everyone for your help!

On Mon, Jun 5, 2017 at 1:20 PM, Sean Owen <so...@cloudera.com> wrote:

> (I apologize for going on about this, but I've asked ~4 times: could you
> make the URLs here in the form email HTTPS URLs? It sounds minor, but we're
> asking people to verify the integrity of software and hashes, and this is
> the one case where it is actually important.)
>
> The "2.2" JIRAs don't look like updates to the non-version-specific web
> pages. If they affect release docs (i.e. under spark.apache.org/docs/),
> or the code, those QA/doc updates have to happen before a release. Right? I
> feel like this is self-evident but this comes up every minor release, that
> some testing or doc changes for a release can happen after the code and
> docs for the release are finalized. They obviously can't.
>
> I know, I get it. I think the reality is that the reporters don't believe
> there is something must-do for the 2.2.0 release, or else they'd have
> spoken up. In that case, these should be closed already as they're
> semantically "Blockers" and we shouldn't make an RC that can't pass.
>
> ... or should we? Actually, to me the idea of an "RC0" release as a
> preview, and RCs that are known to fail for testing purposes seem OK. But
> if that's the purpose here, let's say it.
>
> If the "QA" JIRAs just represent that 'we will test things, in general',
> then I think they're superfluous at best. These aren't used consistently,
> and their intent isn't actionable (i.e. it sounds like no particular
> testing resolves the JIRA). They signal something that doesn't seem to
> match the intent.
>
> Can we close the QA JIRAs -- and are there any actual must-have docs not
> already in the 2.2 branch?
>
> On Mon, Jun 5, 2017 at 8:52 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> I commented on that JIRA, I don't think that should block the release.
>> We can support both options long term if this vote passes.  Looks like the
>> remaining JIRAs are doc/website updates that can happen after the vote or
>> QA that should be done on this RC.  I think we are ready to start testing
>> this release seriously!
>>
>> On Mon, Jun 5, 2017 at 12:40 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Xiao opened a blocker on 2.2.0 this morning:
>>>
>>> SPARK-20980 Rename the option `wholeFile` to `multiLine` for JSON and CSV
>>>
>>> I don't see that this should block?
>>>
>>> We still have 7 Critical issues:
>>>
>>> SPARK-20520 R streaming tests failed on Windows
>>> SPARK-20512 SparkR 2.2 QA: Programming guide, migration guide, vignettes
>>> updates
>>> SPARK-20499 Spark MLlib, GraphX 2.2 QA umbrella
>>> SPARK-20508 Spark R 2.2 QA umbrella
>>> SPARK-20513 Update SparkR website for 2.2
>>> SPARK-20510 SparkR 2.2 QA: Update user guide for new features & APIs
>>> SPARK-20507 Update MLlib, GraphX websites for 2.2
>>>
>>> I'm going to assume that the R test issue isn't actually that big a
>>> deal, and that the 2.2 items are done. Anything that really is for 2.2
>>> needs to block the release; Joseph what's the status on those?
>>>
>>> On Mon, Jun 5, 2017 at 8:15 PM Michael Armbrust <mich...@databricks.com>
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00
>>>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust

I commented on that JIRA, I don't think that should block the release.  We
can support both options long term if this vote passes.  Looks like the
remaining JIRAs are doc/website updates that can happen after the vote or
QA that should be done on this RC.  I think we are ready to start testing
this release seriously!

On Mon, Jun 5, 2017 at 12:40 PM, Sean Owen <so...@cloudera.com> wrote:

> Xiao opened a blocker on 2.2.0 this morning:
>
> SPARK-20980 Rename the option `wholeFile` to `multiLine` for JSON and CSV
>
> I don't see that this should block?
>
> We still have 7 Critical issues:
>
> SPARK-20520 R streaming tests failed on Windows
> SPARK-20512 SparkR 2.2 QA: Programming guide, migration guide, vignettes
> updates
> SPARK-20499 Spark MLlib, GraphX 2.2 QA umbrella
> SPARK-20508 Spark R 2.2 QA umbrella
> SPARK-20513 Update SparkR website for 2.2
> SPARK-20510 SparkR 2.2 QA: Update user guide for new features & APIs
> SPARK-20507 Update MLlib, GraphX websites for 2.2
>
> I'm going to assume that the R test issue isn't actually that big a deal,
> and that the 2.2 items are done. Anything that really is for 2.2 needs to
> block the release; Joseph what's the status on those?
>
> On Mon, Jun 5, 2017 at 8:15 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc4
>> <https://github.com/apache/spark/tree/v2.2.0-rc4> (377cfa8ac7ff7a8
>> a6a6d273182e18ea7dc25ce7e)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1241/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>

[VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc4
 (
377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1241/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

[jira] [Commented] (SPARK-20980) Rename the option `wholeFile` to `multiLine` for JSON and CSV

2017-06-05 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16037416#comment-16037416
 ] 

Michael Armbrust commented on SPARK-20980:
--

I already cut RC4, I think we may just need to accept both options moving 
forward.

> Rename the option `wholeFile` to `multiLine` for JSON and CSV
> -
>
> Key: SPARK-20980
> URL: https://issues.apache.org/jira/browse/SPARK-20980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> The current option name `wholeFile` is misleading for CSV. Currently, it is 
> not representing a record per file. Actually, one file could have multiple 
> records. Thus, we should rename it. Now, the proposal is `multiLine`.
> To make it consistent, we need to rename the same option for JSON and fix the 
> issue in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Michael Armbrust

This should probably fail the vote.  I'll follow up with an RC4.

On Fri, Jun 2, 2017 at 4:11 PM, Wenchen Fan <cloud0...@gmail.com> wrote:

> I'm -1 on this.
>
> I merged a PR <https://github.com/apache/spark/pull/18172> to master/2.2
> today and break the build. I'm really sorry for the trouble and I should
> not be so aggressive when merging PRs. The actual reason is some misleading
> comments in the code and a bug in Spark's testing framework that it never
> run REPL tests unless you change code in REPL module.
>
> I will be more careful in the future, and should NEVER backport
> non-bug-fix commits to an RC branch. Sorry again for the trouble!
>
> On Fri, Jun 2, 2017 at 2:40 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, June 6th, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc3
>> <https://github.com/apache/spark/tree/v2.2.0-rc3> (cc5dbd55b0b312a
>> 661d21a4b605ce5ead2ba5218)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1239/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>
>

[jira] [Closed] (SPARK-20737) Mechanism for cleanup hooks, for structured-streaming sinks on executor shutdown.

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-20737.

Resolution: Won't Fix

> Mechanism for cleanup hooks, for structured-streaming sinks on executor 
> shutdown.
> -
>
> Key: SPARK-20737
> URL: https://issues.apache.org/jira/browse/SPARK-20737
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>  Labels: Kafka
>
> Add a standard way of cleanup during shutdown of executors for structured 
> streaming sinks in general and KafkaSink in particular.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Michael Armbrust

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Tues, June 6th, 2017 at 12:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc3
 (
cc5dbd55b0b312a661d21a4b605ce5ead2ba5218)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1239/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc3-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-06-02 Thread Michael Armbrust

This vote fails.  Following shortly with RC3

On Thu, Jun 1, 2017 at 8:28 PM, Reynold Xin <r...@databricks.com> wrote:

> Again (I've probably said this more than 10 times already in different
> threads), SPARK-18350 has no impact on whether the timestamp type is with
> timezone or without timezone. It simply allows a session specific timezone
> setting rather than having Spark always rely on the machine timezone.
>
> On Wed, May 31, 2017 at 11:58 AM, Kostas Sakellis <kos...@cloudera.com>
> wrote:
>
>> Hey Michael,
>>
>> There is a discussion on TIMESTAMP semantics going on the thread "SQL
>> TIMESTAMP semantics vs. SPARK-18350" which might impact Spark 2.2. Should
>> we make a decision there before voting on the next RC for Spark 2.2?
>>
>> Thanks,
>> Kostas
>>
>> On Tue, May 30, 2017 at 12:09 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Last call, anything else important in-flight for 2.2?
>>>
>>> On Thu, May 25, 2017 at 10:56 AM, Michael Allman <mich...@videoamp.com>
>>> wrote:
>>>
>>>> PR is here: https://github.com/apache/spark/pull/18112
>>>>
>>>>
>>>> On May 25, 2017, at 10:28 AM, Michael Allman <mich...@videoamp.com>
>>>> wrote:
>>>>
>>>> Michael,
>>>>
>>>> If you haven't started cutting the new RC, I'm working on a
>>>> documentation PR right now I'm hoping we can get into Spark 2.2 as a
>>>> migration note, even if it's just a mention: https://issues.apache
>>>> .org/jira/browse/SPARK-20888.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On May 22, 2017, at 11:39 AM, Michael Armbrust <mich...@databricks.com>
>>>> wrote:
>>>>
>>>> I'm waiting for SPARK-20814
>>>> <https://issues.apache.org/jira/browse/SPARK-20814> at Marcelo's
>>>> request and I'd also like to include SPARK-20844
>>>> <https://issues.apache.org/jira/browse/SPARK-20844>.  I think we
>>>> should be able to cut another RC midweek.
>>>>
>>>> On Fri, May 19, 2017 at 11:53 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> All the outstanding ML QA doc and user guide items are done for 2.2 so
>>>>> from that side we should be good to cut another RC :)
>>>>>
>>>>>
>>>>> On Thu, 18 May 2017 at 00:18 Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> Seeing an issue with the DataScanExec and some of our integration
>>>>>> tests for the SCC. Running dataframe read and writes from the shell seems
>>>>>> fine but the Redaction code seems to get a "None" when doing
>>>>>> SparkSession.getActiveSession.get in our integration tests. I'm not
>>>>>> sure why but i'll dig into this later if I get a chance.
>>>>>>
>>>>>> Example Failed Test
>>>>>> https://github.com/datastax/spark-cassandra-connector/blob/v
>>>>>> 2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/sp
>>>>>> ark/connector/sql/CassandraSQLSpec.scala#L311
>>>>>>
>>>>>> ```[info]   org.apache.spark.SparkException: Job aborted due to
>>>>>> stage failure: Task serialization failed: 
>>>>>> java.util.NoSuchElementException:
>>>>>> None.get
>>>>>> [info] java.util.NoSuchElementException: None.get
>>>>>> [info] at scala.None$.get(Option.scala:347)
>>>>>> [info] at scala.None$.get(Option.scala:345)
>>>>>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org
>>>>>> $apache$spark$sql$execution$DataSourceScanExec$$redact(DataSo
>>>>>> urceScanExec.scala:70)
>>>>>> [info] at org.apache.spark.sql.execution
>>>>>> .DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>>>>>> [info] at org.apache.spark.sql.execution
>>>>>> .DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>>>>>> ```
>>>>>>
>>>>>> Again this only seems to repo in our IT suite so i'm not sure if this
>>>>>> is a real issue.
>>>>>>
>>>>>>
>>>>>> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley <jos...@databricks.c

[jira] [Updated] (SPARK-20065) Empty output files created for aggregation query in append mode

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20065:
-
Target Version/s: 2.3.0

> Empty output files created for aggregation query in append mode
> ---
>
> Key: SPARK-20065
> URL: https://issues.apache.org/jira/browse/SPARK-20065
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Silvio Fiorito
>
> I've got a Kafka topic which I'm querying, running a windowed aggregation, 
> with a 30 second watermark, 10 second trigger, writing out to Parquet with 
> append output mode.
> Every 10 second trigger generates a file, regardless of whether there was any 
> data for that trigger, or whether any records were actually finalized by the 
> watermark.
> Is this expected behavior or should it not write out these empty files?
> {code}
> val df = spark.readStream.format("kafka")
> val query = df
>   .withWatermark("timestamp", "30 seconds")
>   .groupBy(window($"timestamp", "10 seconds"))
>   .count()
>   .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count")
> query
>   .writeStream
>   .format("parquet")
>   .option("checkpointLocation", aggChk)
>   .trigger(ProcessingTime("10 seconds"))
>   .outputMode("append")
>   .start(aggPath)
> {code}
> As the query executes, do a file listing on "aggPath" and you'll see 339 byte 
> files at a minimum until we arrive at the first watermark and the initial 
> batch is finalized. Even after that though, as there are empty batches it'll 
> keep generating empty files every trigger.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19903:
-
Target Version/s: 2.3.0

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.ProcessingTime
> import org.apache.spark.sql.functions._
> // Create DataSet representing the stream of input lines

[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19903:
-
Component/s: (was: PySpark)

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.ProcessingTime
> import org.apache.spark.sql.functions._
> // Create DataSet representing the stream of

[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19903:
-
Summary: Watermark metadata is lost when using resolved attributes  (was: 
PySpark Kafka streaming query ouput append mode not possible)

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.Processi

[jira] [Updated] (SPARK-19903) PySpark Kafka streaming query ouput append mode not possible

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19903:
-
Description: 
PySpark example reads a Kafka stream. There is watermarking set when handling 
the data window. The defined query uses output Append mode.

The PySpark engine reports the error:
'Append output mode not supported when there are streaming aggregations on 
streaming DataFrames/DataSets'

The Python example:
---
{code}
import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window

if __name__ == "__main__":
if len(sys.argv) != 4:
print("""
Usage: structured_kafka_wordcount.py  
 
""", file=sys.stderr)
exit(-1)

bootstrapServers = sys.argv[1]
subscribeType = sys.argv[2]
topics = sys.argv[3]

spark = SparkSession\
.builder\
.appName("StructuredKafkaWordCount")\
.getOrCreate()

# Create DataSet representing the stream of input lines from kafka
lines = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", bootstrapServers)\
.option(subscribeType, topics)\
.load()\
.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")

# Split the lines into words, retaining timestamps
# split() splits each line into an array, and explode() turns the array 
into multiple rows
words = lines.select(
explode(split(lines.value, ' ')).alias('word'),
lines.timestamp
)

# Group the data by window and word and compute the count of each group
windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
window(words.timestamp, "30 seconds", "30 seconds"), words.word
).count()

# Start running the query that prints the running counts to the console
query = windowedCounts\
.writeStream\
.outputMode('append')\
.format('console')\
.option("truncate", "false")\
.start()

query.awaitTermination()
{code}

The corresponding example in Zeppelin notebook:
{code}
%spark.pyspark

from pyspark.sql.functions import explode, split, window

# Create DataSet representing the stream of input lines from kafka
lines = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "words")\
.load()\
.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")

# Split the lines into words, retaining timestamps
# split() splits each line into an array, and explode() turns the array into 
multiple rows
words = lines.select(
explode(split(lines.value, ' ')).alias('word'),
lines.timestamp
)

# Group the data by window and word and compute the count of each group
windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
window(words.timestamp, "30 seconds", "30 seconds"), words.word
).count()

# Start running the query that prints the running counts to the console
query = windowedCounts\
.writeStream\
.outputMode('append')\
.format('console')\
.option("truncate", "false")\
.start()

query.awaitTermination()
--

Note that the Scala version of the same example in Zeppelin notebook works fine:

import java.sql.Timestamp
import org.apache.spark.sql.streaming.ProcessingTime
import org.apache.spark.sql.functions._

// Create DataSet representing the stream of input lines from kafka
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "words")
.load()

// Split the lines into words, retaining timestamps
val words = lines
.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
.as[(String, Timestamp)]
.flatMap(line => line._1.split(" ").map(word => (word, line._2)))
.toDF("word", "timestamp")

// Group the data by window and word and compute the count of each group
val windowedCounts = words
.withWatermark("timestamp", "30 seconds")
.groupBy(window($"timestamp", "30 seconds", "30 seconds"), $"word")

[jira] [Commented] (SPARK-20002) Add support for unions between streaming and batch datasets

2017-06-02 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035441#comment-16035441
 ] 

Michael Armbrust commented on SPARK-20002:
--

I'm not sure that we will ever support this.  The issue is that for batch 
datasets, we don't track what has been read.  Thus its unclear what should 
happen when the query is restarted.  Instead, I think you can always achieve 
the same result by just loading both datasets as a stream (even if you don't 
plan to change one of them).  Would that work?

> Add support for unions between streaming and batch datasets
> ---
>
> Key: SPARK-20002
> URL: https://issues.apache.org/jira/browse/SPARK-20002
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Leon Pham
>
> Currently unions between streaming datasets and batch datasets are not 
> supported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20147) Cloning SessionState does not clone streaming query listeners

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-20147.
--
  Resolution: Fixed
Assignee: Kunal Khamar
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

Fixed by https://github.com/apache/spark/pull/17379

> Cloning SessionState does not clone streaming query listeners
> -
>
> Key: SPARK-20147
> URL: https://issues.apache.org/jira/browse/SPARK-20147
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>Assignee: Kunal Khamar
> Fix For: 2.2.0
>
>
> Cloning session should clone StreamingQueryListeners registered on the 
> StreamingQueryListenerBus.
> Similar to SPARK-20048, https://github.com/apache/spark/pull/17379



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20928:
-
Description: 
Given the current Source API, the minimum possible latency for any record is 
bounded by the amount of time that it takes to launch a task.  This limitation 
is a result of the fact that {{getBatch}} requires us to know both the starting 
and the ending offset, before any tasks are launched.  In the worst case, the 
end-to-end latency is actually closer to the average batch time + task 
launching time.

For applications where latency is more important than exactly-once output 
however, it would be useful if processing could happen continuously.  This 
would allow us to achieve fully pipelined reading and writing from sources such 
as Kafka.  This kind of architecture would make it possible to process records 
with end-to-end latencies on the order of 1 ms, rather than the 10-100ms that 
is possible today.

One possible architecture here would be to change the Source API to look like 
the following rough sketch:

{code}
  trait Epoch {
def data: DataFrame

/** The exclusive starting position for `data`. */
def startOffset: Offset

/** The inclusive ending position for `data`.  Incrementally updated during 
processing, but not complete until execution of the query plan in `data` is 
finished. */
def endOffset: Offset
  }

  def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], limits: 
Limits): Epoch
{code}

The above would allow us to build an alternative implementation of 
{{StreamExecution}} that processes continuously with much lower latency and 
only stops processing when needing to reconfigure the stream (either due to a 
failure or a user requested change in parallelism.

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20734) Structured Streaming spark.sql.streaming.schemaInference not handling schema changes

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20734:
-
Issue Type: New Feature  (was: Bug)

> Structured Streaming spark.sql.streaming.schemaInference not handling schema 
> changes
> 
>
> Key: SPARK-20734
> URL: https://issues.apache.org/jira/browse/SPARK-20734
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Ram
>
> sparkSession.config("spark.sql.streaming.schemaInference", 
> true).getOrCreate();
> Dataset dataset = 
> sparkSession.readStream().parquet("file:/files-to-process");
> StreamingQuery streamingQuery =
> dataset.writeStream().option("checkpointLocation", 
> "file:/checkpoint-location")
> .outputMode(Append()).start("file:/save-parquet-files");
> streamingQuery.awaitTermination();
> After streaming query started If there's a schema changes on new paruet 
> files under files-to-process directory. Structured Streaming not writing new 
> schema changes. Is it possible to handle these schema changes in Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20958:
-
Labels: release-notes  (was: )

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-20958.
--
Resolution: Won't Fix

Thanks everyone.  Sounds like we'll just provide directions in the release 
notes for users of parquet-avro to pin the version 1.8.1.

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-02 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035012#comment-16035012
 ] 

Michael Armbrust commented on SPARK-19104:
--

I'm about to cut RC3 of 2.2 and there is no pull request to fix this.  
Unfortunately that means it's not going to be fixed in 2.2.0

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codeh

[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15693:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Write schema definition out for file-based data sources to avoid schema 
> inference
> -
>
> Key: SPARK-15693
> URL: https://issues.apache.org/jira/browse/SPARK-15693
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark supports reading a variety of data format, many of which don't have 
> self-describing schema. For these file formats, Spark often can infer the 
> schema by going through all the data. However, schema inference is expensive 
> and does not always infer the intended schema (for example, with json data 
> Spark always infer integer types as long, rather than int).
> It would be great if Spark can write the schema definition out for file-based 
> formats, and when reading the data in, schema can be "inferred" directly by 
> reading the schema definition file without going through full schema 
> inference. If the file does not exist, then the good old schema inference 
> should be performed.
> This ticket certainly merits a design doc that should discuss the spec for 
> schema definition, as well as all the corner cases that this feature needs to 
> handle (e.g. schema merging, schema evolution, partitioning). It would be 
> great if the schema definition is using a human readable format (e.g. JSON).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15380) Generate code that stores a float/double value in each column from ColumnarBatch when DataFrame.cache() is used

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15380:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Generate code that stores a float/double value in each column from 
> ColumnarBatch when DataFrame.cache() is used
> ---
>
> Key: SPARK-15380
> URL: https://issues.apache.org/jira/browse/SPARK-15380
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When DataFrame.cache() is called, data will be stored as column-oriented 
> storage in CachedBatch. The current Catalyst generates Java program to store 
> a computed value to an InternalRow and then the value is stored into 
> CachedBatch even if data is read from ColumnarBatch for ParquetReader. 
> This JIRA generates Java code to store a value into a ColumnarBatch, and 
> store data from the ColumnarBatch to the CachedBatch. This JIRA handles only 
> float and double types for a value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19084) conditional function: field

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19084:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> conditional function: field
> ---
>
> Key: SPARK-19084
> URL: https://issues.apache.org/jira/browse/SPARK-19084
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Chenzhao Guo
>
> field(str, str1, str2, ... ) is a variable-length(>=2) function which returns 
> the index of str in the list (str1, str2, ... ) or 0 if not found.
> Every parameter is required to be subtype of AtomicType.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15691) Refactor and improve Hive support

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15691:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog (So, for a CatalogTable returned by HiveExternalCatalog, 
> we do not need to distinguish tables stored in hive formats and data source 
> tables).
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14878) Support Trim characters in the string trim function

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14878:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Support Trim characters in the string trim function
> ---
>
> Key: SPARK-14878
> URL: https://issues.apache.org/jira/browse/SPARK-14878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kevin yu
>
> The current Spark SQL does not support the trim characters in the string trim 
> function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 
> fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
>  We propose to implement it in this JIRA..
> The ANSI SQL2003's trim Syntax:
> {noformat}
> SQL
>  ::= TRIM   
>  ::= [ [  ] [  ] FROM ] 
> 
>  ::= 
>  ::=
>   LEADING
> | TRAILING
> | BOTH
>  ::= 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16496) Add wholetext as option for reading text in SQL.

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-16496:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Add wholetext as option for reading text in SQL.
> 
>
> Key: SPARK-16496
> URL: https://issues.apache.org/jira/browse/SPARK-16496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prashant Sharma
>
> In multiple text analysis problems, it is not often desirable for the rows to 
> be split by "\n". There exists a wholeText reader for RDD API, and this JIRA 
> just adds the same support for Dataset API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19241) remove hive generated table properties if they are not useful in Spark

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19241:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> remove hive generated table properties if they are not useful in Spark
> --
>
> Key: SPARK-19241
> URL: https://issues.apache.org/jira/browse/SPARK-19241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> When we save a table into hive metastore, hive will generate some table 
> properties automatically. e.g. transient_lastDdlTime, last_modified_by, 
> rawDataSize, etc. Some of them are useless in Spark SQL, we should remove 
> them.
> It will be good if we can get the list of Hive-generated table properties via 
> Hive API, so that we don't need to hardcode them.
> We can take a look at Hive code to see how it excludes these auto-generated 
> table properties when describe table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16317) Add file filtering interface for FileFormat

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-16317:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Add file filtering interface for FileFormat
> ---
>
> Key: SPARK-16317
> URL: https://issues.apache.org/jira/browse/SPARK-16317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Minor
>
> {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) 
> have customized file filtering logics. For example, Parquet needs to filter 
> out summary files, while Avro provides a Hadoop configuration option to 
> filter out all files whose names don't end with ".avro".
> It would be nice to have a general file filtering interface in {{FileFormat}} 
> to handle similar requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19027) estimate size of object buffer for object hash aggregate

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19027:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> estimate size of object buffer for object hash aggregate
> 
>
> Key: SPARK-19027
> URL: https://issues.apache.org/jira/browse/SPARK-19027
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19104:
-
Target Version/s: 2.3.0  (was: 2.2.0)

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.co

[jira] [Updated] (SPARK-18245) Improving support for bucketed table

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18245:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Improving support for bucketed table
> 
>
> Key: SPARK-18245
> URL: https://issues.apache.org/jira/browse/SPARK-18245
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket for improving various execution planning for 
> bucketed tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14098:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing]
>  is a design document for this change (***TODO: Update the document***).
> This JIRA implements a new in-memory cache feature used by DataFrame.cache 
> and Dataset.cache. The followings are basic design based on discussions with 
> Sameer, Weichen, Xiao, Herman, and Nong.
> * Use ColumnarBatch with ColumnVector that are common data representations 
> for columnar storage
> * Use multiple compression scheme (such as RLE, intdelta, and so on) for each 
> ColumnVector in ColumnarBatch depends on its data typpe
> * Generate code that is simple and specialized for each in-memory cache to 
> build an in-memory cache
> * Generate code that directly reads data from ColumnVector for the in-memory 
> cache by whole-stage codegen.
> * Enhance ColumnVector to keep UnsafeArrayData
> * Use primitive-type array for primitive uncompressed data type in 
> ColumnVector
> * Use byte[] for UnsafeArrayData and compressed data
> Based on this design, this JIRA generates two kinds of Java code for 
> DataFrame.cache()/Dataset.cache()
> * Generate Java code to build CachedColumnarBatch, which keeps data in 
> ColumnarBatch
> * Generate Java code to get a value of each column from ColumnarBatch
> ** a Get a value directly from from ColumnarBatch in code generated by whole 
> stage code gen (primary path)
> ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # 
> of columns is more than 100, as backup path)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19014) support complex aggregate buffer in HashAggregateExec

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19014:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> support complex aggregate buffer in HashAggregateExec
> -
>
> Key: SPARK-19014
> URL: https://issues.apache.org/jira/browse/SPARK-19014
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16011) SQL metrics include duplicated attempts

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-16011:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> SQL metrics include duplicated attempts
> ---
>
> Key: SPARK-16011
> URL: https://issues.apache.org/jira/browse/SPARK-16011
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>
> When I ran a simple scan and aggregate query, the number of rows in scan 
> could be different from run to run, but actually scanned result is correct, 
> the SQL metrics is wrong (should not include duplicated attempt), this is a 
> regression since 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18388) Running aggregation on many columns throws SOE

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18388:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Running aggregation on many columns throws SOE
> --
>
> Key: SPARK-18388
> URL: https://issues.apache.org/jira/browse/SPARK-18388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
> Environment: PySpark 2.0.1, Jupyter
>Reporter: Raviteja Lokineni
> Attachments: spark-bug.csv, spark-bug-jupyter.py, 
> spark-bug-stacktrace.txt
>
>
> Usecase: I am generating weekly aggregates of every column of data
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import *
> timeSeries = sqlContext.read.option("header", 
> "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
> # Hive timestamp is interpreted as UNIX timestamp in seconds*
> days = lambda i: i * 86400
> w = (Window()
>  .partitionBy("id")
>  .orderBy(col("dt").cast("timestamp").cast("long"))
>  .rangeBetween(-days(6), 0))
> cols = ["id", "dt"]
> skipCols = ["id", "dt"]
> for col in timeSeries.columns:
> if col in skipCols:
> continue
> cols.append(mean(col).over(w).alias("mean_7_"+col))
> cols.append(count(col).over(w).alias("count_7_"+col))
> cols.append(sum(col).over(w).alias("sum_7_"+col))
> cols.append(min(col).over(w).alias("min_7_"+col))
> cols.append(max(col).over(w).alias("max_7_"+col))
> df = timeSeries.select(cols)
> df.orderBy('id', 
> 'dt').write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("file:///tmp/spark-bug-out.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19989:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite
> 
>
> Key: SPARK-19989
> URL: https://issues.apache.org/jira/browse/SPARK-19989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here: 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/
> And based on Josh's dashboard 
> (https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressSuite_name=stress+test+with+multiple+topics+and+partitions),
>  seems to fail a few times every month.  Here's the full error from the most 
> recent failure:
> Error Message
> {code}
> org.scalatest.exceptions.TestFailedException:  Error adding data: replication 
> factor: 1 larger than available brokers: 0 
> kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)  
> kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)  
> org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
>   scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)  
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
> {code}
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Error adding data: replication factor: 1 larger than available brokers: 0
> kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)
>   kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)
>   
> org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
>   scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
> == Progress ==
>AssertOnQuery(, )
>CheckAnswer: 
>StopStream
>
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@5d888be0,Map())
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(0, 1, 2, 3, 4, 5, 6, 7, 8), message = )
>CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8],[9]
>StopStream
>
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1be724ee,Map())
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(9, 10, 11, 12, 13, 14), message = )
>CheckAnswer: 
> [1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15]
>StopStream
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(), message = )
> => AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range(15), message = Add topic stress7)
>AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range(16, 17, 18, 19, 20, 21, 22), message = Add partitio

[jira] [Updated] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17915:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Prepare ColumnVector implementation for UnsafeData
> --
>
> Key: SPARK-17915
> URL: https://issues.apache.org/jira/browse/SPARK-17915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} 
> and {{OffHeapColumnarVector}}, which are optimized for reading data from 
> Parquet. If they get an array, an map, or an struct from a {{Unsafe}} related 
> data structure, it is inefficient.
> This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that 
> is optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18134) SQL: MapType in Group BY and Joins not working

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18134:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> SQL: MapType in Group BY and Joins not working
> --
>
> Key: SPARK-18134
> URL: https://issues.apache.org/jira/browse/SPARK-18134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1, 
> 2.1.0
>Reporter: Christian Zorneck
>
> Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in 
> GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive 
> feature was removed from Spark. This makes Spark incompatible to various 
> HiveQL statements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18455) General support for correlated subquery processing

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18455:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18455-scoping-doc.pdf
>
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15690:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15689) Data source API v2

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15689:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13184:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Support minPartitions parameter for JSON and CSV datasources as options
> ---
>
> Key: SPARK-13184
> URL: https://issues.apache.org/jira/browse/SPARK-13184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After looking through the pull requests below at Spark CSV datasources,
> https://github.com/databricks/spark-csv/pull/256
> https://github.com/databricks/spark-csv/issues/141
> https://github.com/databricks/spark-csv/pull/186
> It looks Spark might need to be able to set {{minPartitions}}.
> {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs 
> to shuffle the data for most cases.
> Although I am still not sure if it needs this, I will open this ticket just 
> for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13682) Finalize the public API for FileFormat

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13682:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Finalize the public API for FileFormat
> --
>
> Key: SPARK-13682
> URL: https://issues.apache.org/jira/browse/SPARK-13682
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>    Reporter: Michael Armbrust
>
> The current file format interface needs to be cleaned up before its 
> acceptable for public consumption:
>  - Have a version that takes Row and does a conversion, hide the internal API.
>  - Remove bucketing
>  - Remove RDD and the broadcastedConf
>  - Remove SQLContext (maybe include SparkSession?)
>  - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9221) Support IntervalType in Range Frame

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9221:

Target Version/s: 2.3.0  (was: 2.2.0)

> Support IntervalType in Range Frame
> ---
>
> Key: SPARK-9221
> URL: https://issues.apache.org/jira/browse/SPARK-9221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> Support the IntervalType in window range frames, as mentioned in the 
> conclusion of the databricks  blog 
> [post|https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html]
>  on window functions.
> This actualy requires us to support Literals instead of Integer constants in 
> Range Frames. The following things will have to be modified:
> * org.apache.spark.sql.hive.HiveQl
> * org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame
> * org.apache.spark.sql.execution.Window
> * org.apache.spark.sql.expressions.Window



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20319) Already quoted identifiers are getting wrapped with additional quotes

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20319:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Already quoted identifiers are getting wrapped with additional quotes
> -
>
> Key: SPARK-20319
> URL: https://issues.apache.org/jira/browse/SPARK-20319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Umesh Chaudhary
>
> The issue was caused by 
> [SPARK-16387|https://issues.apache.org/jira/browse/SPARK-16387] where 
> reserved SQL words are honored by wrapping quotes on column names. 
> In our test we found that when quotes are explicitly wrapped in column names 
> then Oracle JDBC driver is throwing : 
> java.sql.BatchUpdateException: ORA-01741: illegal zero-length identifier 
> at 
> oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296)
>  
> at 
> oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246)
>  
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
>  
> and Cassandra JDBC driver is throwing : 
> 17/04/12 19:03:48 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 
> (TID 6)
> java.sql.SQLSyntaxErrorException: [FMWGEN][Cassandra JDBC 
> Driver][Cassandra]syntax error or access rule violation: base table or view 
> not found: 
>   at weblogic.jdbc.cassandrabase.ddcl.b(Unknown Source)
>   at weblogic.jdbc.cassandrabase.ddt.a(Unknown Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:118)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:571)
> CC: [~rxin] , [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9576) DataFrame API improvement umbrella ticket (in Spark 2.x)

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9576:

Target Version/s: 2.3.0  (was: 2.2.0)

> DataFrame API improvement umbrella ticket (in Spark 2.x)
> 
>
> Key: SPARK-9576
> URL: https://issues.apache.org/jira/browse/SPARK-9576
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18394) Executing the same query twice in a row results in CodeGenerator cache misses

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18394:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Executing the same query twice in a row results in CodeGenerator cache misses
> -
>
> Key: SPARK-18394
> URL: https://issues.apache.org/jira/browse/SPARK-18394
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: HiveThriftServer2 running on branch-2.0 on Mac laptop
>Reporter: Jonny Serencsa
>
> Executing the query:
> {noformat}
> select
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> from
> lineitem_1_row
> where
> l_shipdate <= date_sub('1998-12-01', '90')
> group by
> l_returnflag,
> l_linestatus
> ;
> {noformat}
> twice (in succession), will result in CodeGenerator cache misses in BOTH 
> executions. Since the query is identical, I would expect the same code to be 
> generated. 
> Turns out, the generated code is not exactly the same, resulting in cache 
> misses when performing the lookup in the CodeGenerator cache. Yet, the code 
> is equivalent. 
> Below is (some portion of the) generated code for two runs of the query:
> run-1
> {noformat}
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
> import scala.collection.Iterator;
> import org.apache.spark.sql.types.DataType;
> import org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder;
> import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter;
> import org.apache.spark.sql.execution.columnar.MutableUnsafeRow;
> public SpecificColumnarIterator generate(Object[] references) {
> return new SpecificColumnarIterator();
> }
> class SpecificColumnarIterator extends 
> org.apache.spark.sql.execution.columnar.ColumnarIterator {
> private ByteOrder nativeOrder = null;
> private byte[][] buffers = null;
> private UnsafeRow unsafeRow = new UnsafeRow(7);
> private BufferHolder bufferHolder = new BufferHolder(unsafeRow);
> private UnsafeRowWriter rowWriter = new UnsafeRowWriter(bufferHolder, 7);
> private MutableUnsafeRow mutableRow = null;
> private int currentRow = 0;
> private int numRowsInBatch = 0;
> private scala.collection.Iterator input = null;
> private DataType[] columnTypes = null;
> private int[] columnIndexes = null;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor1;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor2;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor3;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor4;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor5;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor6;
> public SpecificColumnarIterator() {
> this.nativeOrder = ByteOrder.nativeOrder();
> this.buffers = new byte[7][];
> this.mutableRow = new MutableUnsafeRow(rowWriter);
> }
> public void initialize(Iterator input, DataType[] columnTypes, int[] 
> columnIndexes) {
> this.input = input;
> this.columnTypes = columnTypes;
> this.columnIndexes = columnIndexes;
> }
> public boolean hasNext() {
> if (currentRow < numRowsInBatch) {
> return true;
> }
> if (!input.hasNext()) {
> return false;
> }
> org.apache.spark.sql.execution.columnar.CachedBatch batch = 
> (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
> currentRow = 0;
> numRowsInBatch = batch.numRows();
> for (int i = 0; i < columnIndexes.length; i ++) {
> buffers[i] = batch.buffers()[columnIndexes[i]];
> }
> accessor = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[0]).order(nativeOrder));
> accessor1 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[1]).order(nativeOrder));
> accessor2 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[2]).order(nativeOrder));
> accessor3 = new 
> org.apache.spark.sql.execu

[jira] [Updated] (SPARK-18891) Support for specific collection types

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18891:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14543) SQL/Hive insertInto has unexpected results

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14543:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> SQL/Hive insertInto has unexpected results
> --
>
> Key: SPARK-14543
> URL: https://issues.apache.org/jira/browse/SPARK-14543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>
> *Updated description*
> There should be an option to match input data to output columns by name. The 
> API allows operations on tables, which hide the column resolution problem. 
> It's easy to copy from one table to another without listing the columns, and 
> in the API it is common to work with columns by name rather than by position. 
> I think the API should add a way to match columns by name, which is closer to 
> what users expect. I propose adding something like this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}
> *Original description*
> The Hive write path adds a pre-insertion cast (projection) to reconcile 
> incoming data columns with the outgoing table schema. Columns are matched by 
> position and casts are inserted to reconcile the two column schemas.
> When columns aren't correctly aligned, this causes unexpected results. I ran 
> into this by not using a correct {{partitionBy}} call (addressed by 
> SPARK-14459), which caused an error message that an int could not be cast to 
> an array. However, if the columns are vaguely compatible, for example string 
> and float, then no error or warning is produced and data is written to the 
> wrong columns using unexpected casts (string -> bigint -> float).
> A real-world use case that will hit this is when a table definition changes 
> by adding a column in the middle of a table. Spark SQL statements that copied 
> from that table to a destination table will then map the columns differently 
> but insert casts that mask the problem. The last column's data will be 
> dropped without a reliable warning for the user.
> This highlights a few problems:
> * Too many or too few incoming data columns should cause an AnalysisException 
> to be thrown
> * Only "safe" casts should be inserted automatically, like int -> long, using 
> UpCast
> * Pre-insertion casts currently ignore extra columns by using zip
> * The pre-insertion cast logic differs between Hive's MetastoreRelation and 
> LogicalRelation
> Also, I think there should be an option to match input data to output columns 
> by name. The API allows operations on tables, which hide the column 
> resolution problem. It's easy to copy from one table to another without 
> listing the columns, and in the API it is common to work with columns by name 
> rather than by position. I think the API should add a way to match columns by 
> name, which is closer to what users expect. I propose adding something like 
> this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17556:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15694) Implement ScriptTransformation in sql/core

2017-06-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15694:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 5069 matches

Mail list logo