Multiple queries on same stream

2017-08-08 Thread Raghavendra Pandey
I am using structured streaming to evaluate multiple rules on same running stream. I have two options to do that. One is to use forEach and evaluate all the rules on the row.. The other option is to express rules in spark sql dsl and run multiple queries. I was wondering if option 1 will result in

Unsubscribe

2017-08-08 Thread Chandrashekhar Vaidya
    From: Sumit Saraswat [mailto:sumitxapa...@gmail.com] Sent: Tuesday, August 8, 2017 10:33 PM To: user@spark.apache.org Subject: Unsubscribe   Unsubscribe

Reusing dataframes for streaming (spark 1.6)

2017-08-08 Thread Ashwin Raju
Hi, We've built a batch application on Spark 1.6.1. I'm looking into how to run the same code as a streaming (DStream based) application. This is using pyspark. In the batch application, we have a sequence of transforms that read from file, do dataframe operations, then write to file. I was

Re: Question about 'Structured Streaming'

2017-08-08 Thread Michael Armbrust
> > 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header > that contains the 'schema' for the data, each log http/dns/etc will have > different columns with different data types. So would I create a specific > CSV reader inherited from the general one? Also I'm assuming this

StructuredStreaming: java.util.concurrent.TimeoutException: Cannot fetch record for offset

2017-08-08 Thread aravias
Hi, we have a structured streaming app consuming data from kafka and writing to s3. I keep getting this timeout exception whenever the executor is specified and running with more than one core per executor. If someone can share any info related to this if you know it would be great. 17/08/08

How to get the lag of a column in a spark streaming dataframe?

2017-08-08 Thread Prashanth Kumar Murali
I have data streaming into my spark scala application in this format idmark1 mark2 mark3 time uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017 uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017 uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017 uuid2 150 250 350 Tue Aug 8

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2017-08-08 Thread dcam
Considering the @transient annotations and the work done in the instance initializer, not much state is really be broadcast to the executors. It might be simpler to just create these instances on the executors, rather than trying to broadcast them? -- View this message in context:

Re: Question about 'Structured Streaming'

2017-08-08 Thread Brian Wylie
I can see your point that you don't really want an external process being used for the streaming data sourceOkay so on the CSV/TSV front, I have two follow up questions: 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header that contains the 'schema' for the data, each log

[Spark Structured Streaming]: truncated Parquet after driver crash or kill

2017-08-08 Thread dcam
Hello list We have a Spark application that performs a set of ETLs: reading messages from a Kafka topic, categorizing them, and writing the contents out as Parquet files on HDFS. After writing, we are querying the data from HDFS using Presto's hive integration. We are having problems because the

Re: Question about 'Structured Streaming'

2017-08-08 Thread Michael Armbrust
Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to read bro logs, rather than a python library. This is likely to have much better performance since we can do all of the parsing on the JVM without having to flow it though an external python process. On Tue, Aug 8, 2017 at

Unsubscribe

2017-08-08 Thread Benjamin Soemartopo
From: john_test_test Sent: Wednesday, August 9, 2017 3:09:44 AM To: user@spark.apache.org Subject: speculative execution in spark Is it possible by anyhow to take advantage of the already processed portion of the failed task so I can

Re: Spark Streaming job statistics

2017-08-08 Thread KhajaAsmath Mohammed
No. will take a look now. On Tue, Aug 8, 2017 at 1:47 PM, Riccardo Ferrari wrote: > Hi, > > Have you tried to check the "Streaming" tab menu? > > Best, > > On Tue, Aug 8, 2017 at 4:15 PM, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Hi, >> >> I am running

Re: Spark Streaming job statistics

2017-08-08 Thread Riccardo Ferrari
Hi, Have you tried to check the "Streaming" tab menu? Best, On Tue, Aug 8, 2017 at 4:15 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am running spark streaming job which receives data from azure iot hub. I > am not sure if the connection was successful and receving

Re: count exceed int.MaxValue

2017-08-08 Thread Vadim Semenov
Scala doesn't support ranges >= Int.MaxValue https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/immutable/Range.scala?utf8=✓#L89 You can create two RDDs and unionize them: scala> val rdd = sc.parallelize(1L to Int.MaxValue.toLong).union(sc.parallelize(1L to

count exceed int.MaxValue

2017-08-08 Thread makoto
Hello, I'd like to count more than Int.MaxValue. But I encountered the following error. scala> val rdd = sc.parallelize(1L to Int.MaxValue*2.toLong) rdd: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[28] at parallelize at :24 scala> rdd.count java.lang.IllegalArgumentException: More

Fwd: Python question about 'Structured Streaming'

2017-08-08 Thread Brian Wylie
Hi All, I've read the new information about Structured Streaming in Spark, looks super great. Resources that I've looked at - https://spark.apache.org/docs/latest/streaming-programming-guide.html - https://databricks.com/blog/2016/07/28/structured-streamin g-in-apache-spark.html -

speculative execution in spark

2017-08-08 Thread john_test_test
Is it possible by anyhow to take advantage of the already processed portion of the failed task so I can use the speculative execution to reassign only the what is left from the original task to another node? if yes then how can I read it from memroy? -- View this message in context:

Unsubscribe

2017-08-08 Thread Sumit Saraswat
Unsubscribe

Question about 'Structured Streaming'

2017-08-08 Thread Brian Wylie
Hi All, I've read the new information about Structured Streaming in Spark, looks super great. Resources that I've looked at - https://spark.apache.org/docs/latest/streaming-programming-guide.html - https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html -

unsubscribe

2017-08-08 Thread Ramesh Krishnan
unsubscribe

IndexOutOfBoundException in catalyst when doing multiple approxDistinctCount

2017-08-08 Thread AssafMendelson
Hi, I am doing a large number of aggregations on a dataframe (without groupBy) to get some statistics. As part of this I am doing an approx_count_distinct(c, 0.01) Everything works fine but when I do the same aggregation a second time (for each column) I get the following error: [Stage 2:>

Spark Streaming job statistics

2017-08-08 Thread KhajaAsmath Mohammed
Hi, I am running spark streaming job which receives data from azure iot hub. I am not sure if the connection was successful and receving any data. does the input column show how much data it has read if the connection was successful? [image: Inline image 1]

Re:

2017-08-08 Thread Ramesh Krishnan
unsubscribe On Mon, Aug 7, 2017 at 2:57 PM, Sumit Saraswat wrote: > Unsubscribe >

Re: [SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-08 Thread Jacek Laskowski
Hi Michael, That reflects my sentiments so well. Thanks for having confirmed my thoughts! https://issues.apache.org/jira/browse/SPARK-21667 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at

Re: Matrix multiplication and cluster / partition / blocks configuration

2017-08-08 Thread Phillip Henry
Hi, John. I've had similar problems. IIRC, the driver was GCing madly. I don't know why the driver was doing so much work but I quickly implemented an alternative approach. The code I wrote belongs to my client but I wrote something that should be equivalent. It can be found at:

Re: tuning - Spark data serialization for cache() ?

2017-08-08 Thread Kazuaki Ishizaki
For Dataframe (and Dataset) cache(), neither Java nor Kryo serialization is used. There is no way to use Java or Kryo serialization for DataFrame.cache() or Dataset.cache() for in-memory. Are you talking about serialization to Disk? In previous mail, I talked about only in-memory. Regards,