Spark-avro 4.0.0 is released

2017-11-10 Thread Gengliang Wang
The 4.0.0 release adds support for Spark 2.2. The published artifact is
compatible with both Spark 2.1 and 2.2.

New Features:

   - Support for Spark 2.2 (#242
   ): resolve
   compatibility issue with datasource write API changes
   

   .

Bug fixes:

   - Fix name conflict in nested records (#249
   )


Release history:

   - https://github.com/databricks/spark-avro/releases


Thanks for the contributions from Imran Rashid, Gerard Solà and Jacky Shen!

-- 
Wang Gengliang
Software Engineer
Databricks Inc.


Re: spark-stream memory table global?

2017-11-10 Thread Shixiong(Ryan) Zhu
It must be accessed under the same SparkSession. We can also add an option
to make it be a global temp view. Feel free to open a PR to improve it.

On Fri, Nov 10, 2017 at 4:56 AM, Imran Rajjad  wrote:

> Hi,
>
> Does the memory table in which spark-structured streaming results are
> sinked into, is available to other spark applications on the cluster? Is it
> by default global or will only be available to context where streaming is
> being done
>
> thanks
> Imran
>
> --
> I.R
>


Re: Generate windows on processing time in Spark Structured Streaming

2017-11-10 Thread Michael Armbrust
Hmmm, we should allow that.  current_timestamp() is acutally deterministic
within any given batch.  Could you open a JIRA ticket?

On Fri, Nov 10, 2017 at 1:52 AM, wangsan  wrote:

> Hi all,
>
> How can I use current processing time to generate windows in streaming
> processing?
> *window* function's Scala doc says "For a streaming query, you may use
> the function current_timestamp to generate windows on processing time.”
>  But when using current_timestamp as column in window function, exceptions
> occurred.
>
> Here are my code:
>
> val socketDF = spark.readStream
>   .format("socket")
>   .option("host", "localhost")
>   .option("port", )
>   .load()
>
> socketDF.createOrReplaceTempView("words")
> val windowedCounts = spark.sql(
>   """
> |SELECT value as word, current_timestamp() as time, count(1) as count 
> FROM words
> |   GROUP BY window(time, "5 seconds"), word
>   """.stripMargin)
>
> windowedCounts
>   .writeStream
>   .outputMode("complete")
>   .format("console")
>   .start()
>   .awaitTermination()
>
> And here are Exception Info:
> *Caused by: org.apache.spark.sql.AnalysisException: nondeterministic
> expressions are only allowed in*
> *Project, Filter, Aggregate or Window, *found:
>
>
>
>


Parquet files from spark not readable in Cascading

2017-11-10 Thread Vikas Gandham
Hi,

When I  tried reading parquet data that was generated by spark in cascading it 
throws following error



Caused by: 
org.apache.parquet.io.ParquetDecodingException: 
Can not read value at 0 in block -1 in file ""
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.(DeprecatedParquetInputFormat.java:103)
at 
org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47)
at 
cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253)
at 
cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248)
at cascading.util.Util.retry(Util.java:1044)
at 
cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:418)
at java.util.ArrayList.get(ArrayList.java:431)
at 
org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)
at 
org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)
at 
org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83)
at 
org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77)
at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:293)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)

This is mostly seen when parquet has nested structures.

I didnt find any solution to this.

I see some JIRA issues like this 
https://issues.apache.org/jira/browse/SPARK-10434 (parquet compatability 
/interoperabilityissues) where reading parquet files in Spark 1.4 where the 
files
were generated by Spark 1.5 .This was fixed in later versions but was it fixed 
in Cascading?

Not sure if this is something to do with Parquet version or Cascading has a bug 
or Spark is doing something with Parquet files
which cascading is not accepting

Note : I am trying to read Parquet with avro schema in Cascading

I have posted in Cascading mailing list too.


Vikas Gandham
Software Engineer
MAXPOINT(r)
Direct: 512.354.6185
www.maxpoint.com
Subscribe to MaxPoint's OnPoint Blog.

This email and any attachments may contain private, confidential and privileged 
material for the sole use of the intended recipient. If you are not the 
intended recipient, please immediately delete this email and any attachments.


Spark Streaming Kafka

2017-11-10 Thread Frank Staszak
Hi All, I’m new to streaming avro records and am parsing Avro from a Kafka 
direct stream with spark streaming 2.1.1, I was wondering if anyone could 
please suggest an API for decoding Avro records with Scala? I’ve found 
KafkaAvroDecoder, twitter/bijection and the Avro library, each seem to handle 
decoding, has anyone found benefits in terms of using one over the other (for 
decoding)? It would seem preferable to just retrieve the avro schema from the 
schema registry then translate the avro records to a case class, is this the 
preferred method to decode avro using the KafkaAvroDecoder?

Thank you in advance,
-Frank
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark-stream memory table global?

2017-11-10 Thread Imran Rajjad
Hi,

Does the memory table in which spark-structured streaming results are
sinked into, is available to other spark applications on the cluster? Is it
by default global or will only be available to context where streaming is
being done

thanks
Imran

-- 
I.R


Generate windows on processing time in Spark Structured Streaming

2017-11-10 Thread wangsan
Hi all,


How can I use current processing time to generate windows in streaming 
processing? 
window function's Scala doc says "For a streaming query, you may use the 
function current_timestamp to generate windows on processing time.”  But when 
using current_timestamp as column in window function, exceptions occurred.


Here are my code:
val socketDF = spark.readStream
  .format("socket")
.option("host", "localhost")
.option("port", )
.load()

socketDF.createOrReplaceTempView("words")
val windowedCounts = spark.sql(
"""
|SELECT value as word, current_timestamp() as time, count(1) as count FROM 
words
|   GROUP BY window(time, "5 seconds"), word
  """.stripMargin)

windowedCounts
  .writeStream
  .outputMode("complete")
.format("console")
.start()
.awaitTermination()
And here are Exception Info:
Caused by: org.apache.spark.sql.AnalysisException: nondeterministic expressions 
are only allowed in
Project, Filter, Aggregate or Window, found: