Re: Query around Spark Checkpoints

2020-09-29 Thread Bryan Jeffrey
Jungtaek, How would you contrast stateful streaming with checkpoint vs. the idea of writing updates to a Delta Lake table, and then using the Delta Lake table as a streaming source for our state stream? Thank you, Bryan On Mon, Sep 28, 2020 at 9:50 AM Debabrata Ghosh wrote: > Thank You

Re: Metrics Problem

2020-07-10 Thread Bryan Jeffrey
On Thu, Jul 2, 2020 at 2:33 PM Bryan Jeffrey wrote: > Srinivas, > > I finally broke a little bit of time free to look at this issue. I > reduced the scope of my ambitions and simply cloned a the ConsoleSink and > ConsoleReporter class. After doing so I can see the original

Re: Metrics Problem

2020-06-28 Thread Bryan Jeffrey
Srinivas, Interestingly, I did have the metrics jar packaged as part of my main jar. It worked well both on driver and locally, but not on executors. Regards, Bryan Jeffrey Get Outlook for Android<https://aka.ms/ghei36> From: Srinivas V Sent: Saturday

Re: Metrics Problem

2020-06-26 Thread Bryan Jeffrey
providers which appear to work. Regards, Bryan Get Outlook for Android<https://aka.ms/ghei36> From: Srinivas V Sent: Friday, June 26, 2020 9:47:52 PM To: Bryan Jeffrey Cc: user Subject: Re: Metrics Problem It should work when you are giving hdfs path a

Re: Metrics Problem

2020-06-26 Thread Bryan Jeffrey
It may be helpful to note that I'm running in Yarn cluster mode. My goal is to avoid having to manually distribute the JAR to all of the various nodes as this makes versioning deployments difficult. On Thu, Jun 25, 2020 at 5:32 PM Bryan Jeffrey wrote: > Hello. > > I am running Spark

Metrics Problem

2020-06-25 Thread Bryan Jeffrey
Hello. I am running Spark 2.4.4. I have implemented a custom metrics producer. It works well when I run locally, or specify the metrics producer only for the driver. When I ask for executor metrics I run into ClassNotFoundExceptions *Is it possible to pass a metrics JAR via --jars? If so what

Custom Metrics

2020-06-18 Thread Bryan Jeffrey
. Is there a suggested mechanism? Thank you, Bryan Jeffrey

Data Source - State (SPARK-28190)

2020-03-30 Thread Bryan Jeffrey
to be potentially addressed by your ticket SPARK-28190 - "Data Source - State". I see little activity on this ticket. Can you help me to understand where this feature currently stands? Thank you, Bryan Jeffrey

Re: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-29 Thread Bryan Jeffrey
like to come up to speed on the right way to validate changes and perhaps contribute myself. Regards, Bryan Jeffrey On Sat, Feb 29, 2020 at 9:52 AM Jungtaek Lim wrote: > Forgot to mention - it only occurs the SQL type of UDT is having fixed > length. If the UDT is used to represent c

Fwd: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Bryan Jeffrey
github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/types/SQLUserDefinedType.java>` > on the class definition. You dont seem to have done it, maybe thats the > reason? > > I would debug by printing the values in the serialize/deserialize methods, &g

Re: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Bryan Jeffrey
Perfect. I'll give this a shot and report back. Get Outlook for Android<https://aka.ms/ghei36> From: Tathagata Das Sent: Friday, February 28, 2020 6:23:07 PM To: Bryan Jeffrey Cc: user Subject: Re: Structured Streaming: mapGroupsWithState UDT seriali

Re: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Bryan Jeffrey
serialization bug also causes issues outside of stateful streaming, as when executing a simple group by. Regards, Bryan Jeffrey Get Outlook for Android<https://aka.ms/ghei36> From: Tathagata Das Sent: Friday, February 28, 2020 4:56:07 PM To: Bryan Jeffr

Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Bryan Jeffrey
FooWithDate = FooWithDate(b.date, a.s + b.s, a.i + b.i) } The test output shows the invalid date: org.scalatest.exceptions.TestFailedException: Array(FooWithDate(2021-02-02T19:26:23.374Z,Foo,1), FooWithDate(2021-02-02T19:26:23.374Z,FooFoo,6)) did not contain the same elements as Array(FooWithDate(2020-01-02T03:04:05.006Z,Foo,1), FooWithDate(2020-01-02T03:04:05.006Z,FooFoo,6)) Is this something folks have encountered before? Thank you, Bryan Jeffrey

Accumulator v2

2020-01-21 Thread Bryan Jeffrey
, Bryan Jeffrey

Structured Streaming & Enrichment Broadcasts

2019-11-18 Thread Bryan Jeffrey
read an external data source and do a fast lookup for a streaming input. One option appears to be to do a broadcast left outer join? In the past this mechanism has been less easy to performance tune than doing an explicit broadcast and lookup. Regards, Bryan Jeffrey

Driver - Stops Scheduling Streaming Jobs

2019-08-27 Thread Bryan Jeffrey
anything pop out. Has anyone else seen this behavior? Any thoughts on debugging? Regards, Bryan Jeffrey

Driver OOM does not shut down Spark Context

2019-01-31 Thread Bryan Jeffrey
the Streaming application never terminated. No new batches were started. As a result my job did not process data for some period of time (until our ancillary monitoring noticed the issue). *Ask: What can we do to ensure that the driver is shut down when this type of exception occurs?* Regards, Bryan

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-31 Thread Bryan Jeffrey
Cody, Yes - I was able to verify that I am not seeing duplicate calls to createDirectStream. If the spark-streaming-kafka-0-10 will work on a 2.3 cluster I can go ahead and give that a shot. Regards, Bryan Jeffrey On Fri, Aug 31, 2018 at 11:56 AM Cody Koeninger wrote: > Just to be 100% s

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-31 Thread Bryan Jeffrey
. Regards, Bryan Jeffrey On Thu, Aug 30, 2018 at 2:56 PM Cody Koeninger wrote: > I doubt that fix will get backported to 2.3.x > > Are you able to test against master? 2.4 with the fix you linked to > is likely to hit code freeze soon. > > From a quick look at your code

ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-30 Thread Bryan Jeffrey
ructType(Array(StructField("value", BinaryType df.selectExpr("CAST('' AS STRING)", "value") .write .format("kafka") .option("kafka.bootstrap.servers", getBrokersToUse(brokers)) .option("compression.type", "gzip") .option("retries", "3") .option("topic", topic) .save() } Regards, Bryan Jeffrey

Heap Memory in Spark 2.3.0

2018-07-16 Thread Bryan Jeffrey
ocation in Spark 2.3.0 that would cause these issues? Thank you for any help you can provide. Regards, Bryan Jeffrey

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
:01:01 PM To: Bryan Jeffrey Cc: user Subject: Re: Kafka Offset Storage: Fetching Offsets Offsets are loaded when you instantiate an org.apache.kafka.clients.consumer.KafkaConsumer, subscribe, and poll. There's not an explicit api for it. Have you looked at the output of kafka-consumer-gro

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
Cody, Where is that called in the driver? The only call I see from Subscribe is to load the offset from checkpoint. Get Outlook for Android<https://aka.ms/ghei36> From: Cody Koeninger Sent: Thursday, June 14, 2018 4:24:58 PM To: Bryan Jeffrey Cc: user S

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
:31 PM To: Bryan Jeffrey Cc: user Subject: Re: Kafka Offset Storage: Fetching Offsets The expectation is that you shouldn't have to manually load offsets from kafka, because the underlying kafka consumer on the driver will start at the offsets associated with the given group id. That's the behavior

Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
reaming Kafka library. How is this expected to work? Is there an example of saving the offsets to Kafka and then loading them from Kafka? Regards, Bryan Jeffrey

Re: [Spark 2.x Core] Adding to ArrayList inside rdd.foreach()

2018-04-07 Thread Bryan Jeffrey
You can just call rdd.flatMap(_._2).collect Get Outlook for Android From: klrmowse Sent: Saturday, April 7, 2018 1:29:34 PM To: user@spark.apache.org Subject: Re: [Spark 2.x Core] Adding to ArrayList inside

Re: Return statements aren't allowed in Spark closures

2018-02-21 Thread Bryan Jeffrey
Lian, You're writing Scala. Just remove the 'return'. No need for it in Scala. Get Outlook for Android From: Lian Jiang Sent: Wednesday, February 21, 2018 4:16:08 PM To: user Subject: Return statements aren't

Stopping a Spark Streaming Context gracefully

2017-11-07 Thread Bryan Jeffrey
. I looked in Jira and did not see an open issue. Is this a known problem? If not I'll open a bug. Regards, Bryan Jeffrey

Re: about broadcast join of base table in spark sql

2017-06-30 Thread Bryan Jeffrey
Hello. If you want to allow broadcast join with larger broadcasts you can set spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the plan to allow join despite 'A' being larger than the default threshold. Get Outlook for Android From: paleyl Sent:

Re: Question about Parallel Stages in Spark

2017-06-27 Thread Bryan Jeffrey
A and B. Jeffrey' code did not cause two submit. ---Original---From: "Pralabh Kumar"<pralabhku...@gmail.com>Date: 2017/6/27 12:09:27To: "萝卜丝炒饭"<1427357...@qq.com>;Cc: "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;&

Re: Question about Parallel Stages in Spark

2017-06-26 Thread Bryan Jeffrey
Hello. The driver is running the individual operations in series, but each operation is parallelized internally. If you want them run in parallel you need to provide the driver a mechanism to thread the job scheduling out: val rdd1 = sc.parallelize(1 to 10) val rdd2 = sc.parallelize(1 to

Re: Broadcasts & Storage Memory

2017-06-21 Thread Bryan Jeffrey
of that storage memory fraction. Bryan Jeffrey Get Outlook for Android On Wed, Jun 21, 2017 at 6:48 PM -0400, "satish lalam" <satish.la...@gmail.com> wrote: My understanding is - it from storageFraction. Here cached blocks are immune to eviction - so bot

Broadcasts & Storage Memory

2017-06-21 Thread Bryan Jeffrey
the storage memory for cached RDDs. You end up with executor memory that looks like the following: All memory: 0-100 Spark memory: 0-75 RDD Storage: 0-37 Other Spark: 38-75 Other Reserved: 76-100 Where do broadcast variables fall into the mix? Regards, Bryan Jeffrey

Re: how many topics spark streaming can handle

2017-06-19 Thread Bryan Jeffrey
Hello Ashok, We're consuming from more than 10 topics in some Spark streaming applications. Topic management is a concern (what is read from where, etc), but I have seen no issues from Spark itself. Regards, Bryan Jeffrey Get Outlook for Android On Mon, Jun 19, 2017

Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Bryan Jeffrey
provides a lot of similar features, but the amount of typing required to set down a small function is excessive at best! Regards, Bryan Jeffrey On Wed, Jun 7, 2017 at 12:51 PM, Jörn Franke <jornfra...@gmail.com> wrote: > I think this is a religious question ;-) > Java is often un

Re: Is there a way to do conditional group by in spark 2.1.1?

2017-06-03 Thread Bryan Jeffrey
You should be able to project a new column that is your group column. Then you can group on the projected column. Get Outlook for Android On Sat, Jun 3, 2017 at 6:26 PM -0400, "upendra 1991" wrote: Use a function Sent from Yahoo Mail on

Re: Convert camelCase to snake_case when saving Dataframe/Dataset to parquet?

2017-05-22 Thread Bryan Jeffrey
Mike, I have code to do that. I'll share it tomorrow. Get Outlook for Android On Mon, May 22, 2017 at 4:53 PM -0400, "Mike Wheeler" wrote: Hi Spark User, For Scala case class, we usually use camelCase (carType) for member fields. However,

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread bryan . jeffrey
Rdd operation: rdd.map(x => (word, count)).reduceByKey(_+_) Get Outlook for Android On Sat, Mar 4, 2017 at 8:59 AM -0500, "Old-School" wrote: Hi, I want to perform some simple transformations and check the execution time, under

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread bryan . jeffrey
Mohammad, We store our offsets in Cassandra,  and use that for tracking. This solved a few issues for us,  as it provides a good persistence mechanism even when you're reading from multiple clusters. Bryan Jeffrey Get Outlook for Android On Tue, Feb 14, 2017 at 7:03 PM

Re: How to specify "verbose GC" in Spark submit?

2017-02-06 Thread Bryan Jeffrey
ify --conf "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -verbose:gc", etc. Bryan Jeffrey On Mon, Feb 6, 2017 at 8:02 AM, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Dear All, > > Is there any way to specify verbose GC -i.e. “-verbose:gc > -XX:

Re: Streaming Batch Oddities

2016-12-13 Thread Bryan Jeffrey
All, Any thoughts? I can run another couple of experiments to try to narrow the problem. The total data volume in the repartition is around 60GB / batch. Regards, Bryan Jeffrey On Tue, Dec 13, 2016 at 12:11 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I

Streaming Batch Oddities

2016-12-13 Thread Bryan Jeffrey
which calls coalesce (shuffle=true), which creates a new ShuffledRDD with a HashPartitioner. These calls appear functionally equivelent - I am having trouble coming up with a justification for the significant performance differences between calls. Help? Regards, Bryan Jeffrey

Event Log Compression

2016-07-26 Thread Bryan Jeffrey
? Thank you, Bryan Jeffrey

Spark 2.0

2016-07-25 Thread Bryan Jeffrey
be willing to go fix it myself). Should I just create a ticket? Thank you, Bryan Jeffrey

Re: Kafka Exceptions

2016-06-13 Thread Bryan Jeffrey
Cody, We already set the maxRetries. We're still seeing issue - when leader is shifted, for example, it does not appear that direct stream reader correctly handles this. We're running 1.6.1. Bryan Jeffrey On Mon, Jun 13, 2016 at 10:37 AM, Cody Koeninger <c...@koeninger.org> wrote:

Kafka Exceptions

2016-06-13 Thread Bryan Jeffrey
you, Bryan Jeffrey

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
All, Thank you for the replies. It seems as though the Dataset API is still far behind the RDD API. This is unfortunate as the Dataset API potentially provides a number of performance benefits. I will move to using it in a more limited set of cases for the moment. Thank you! Bryan Jeffrey

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
It would also be nice if there was a better example of joining two Datasets. I am looking at the documentation here: http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems a little bit sparse - is there a better documentation source? Regards, Bryan Jeffrey On Tue, Jun 7, 2016

Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
Hello. I am looking at the option of moving RDD based operations to Dataset based operations. We are calling 'reduceByKey' on some pair RDDs we have. What would the equivalent be in the Dataset interface - I do not see a simple reduceByKey replacement. Regards, Bryan Jeffrey

Streaming application slows over time

2016-05-09 Thread Bryan Jeffrey
for or known bugs in similar instances? Regards, Bryan Jeffrey

Issues with Long Running Streaming Application

2016-04-25 Thread Bryan Jeffrey
, Bryan Jeffrey

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Bryan Jeffrey
ns = kafkaWritePartitions) detectionWriter.write(dataToWriteToKafka) Hope that helps! Bryan Jeffrey On Thu, Apr 21, 2016 at 2:08 PM, Alexander Gallego <agall...@concord.io> wrote: > Thanks Ted. > > KafkaWordCount (producer) does not operate on a DStream[T

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Bryan Jeffrey
Cody et. al, I am seeing a similar error. I've increased the number of retries. Once I've got a job up and running I'm seeing it retry correctly. However, I am having trouble getting the job started - number of retries does not seem to help with startup behavior. Thoughts? Regards, Bryan

Re: OOM Exception in my spark streaming application

2016-03-14 Thread Bryan Jeffrey
Steve & Adam, I would be interesting in hearing the outcome here as well. I am seeing some similar issues in my 1.4.1 pipeline, using stateful functions (reduceByKeyAndWindow and updateStateByKey). Regards, Bryan Jeffrey On Mon, Mar 14, 2016 at 6:45 AM, Steve Loughran <ste...@hortonwo

Re: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Bryan Jeffrey
Prateek, I believe that one task is created per Cassandra partition. How is your data partitioned? Regards, Bryan Jeffrey On Thu, Mar 10, 2016 at 10:36 AM, Prateek . <prat...@aricent.com> wrote: > Hi, > > > > I have a Spark Batch job for reading timeseries data from

Suggested Method to Write to Kafka

2016-03-01 Thread Bryan Jeffrey
Hello. Is there a suggested method and/or some example code to write results from a Spark streaming job back to Kafka? I'm using Scala and Spark 1.4.1. Regards, Bryan Jeffrey

Re: Spark with .NET

2016-02-09 Thread Bryan Jeffrey
Arko, Check this out: https://github.com/Microsoft/SparkCLR This is a Microsoft authored C# language binding for Spark. Regards, Bryan Jeffrey On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Doesn't seem to be supported, but

Re: Access batch statistics in Spark Streaming

2016-02-08 Thread Bryan Jeffrey
>From within a Spark job you can use a Periodic Listener: ssc.addStreamingListener(PeriodicStatisticsListener(Seconds(60))) class PeriodicStatisticsListener(timePeriod: Duration) extends StreamingListener { private val logger = LoggerFactory.getLogger("Application") override def

Error w/ Invertable ReduceByKeyAndWindow

2016-02-01 Thread Bryan Jeffrey
I am sure we're doing consistent hashing. The 'reduceAdd' function is adding to a map. The 'inverseReduceFunction' is subtracting from the map. The filter function is removing items where the number of entries in the map is zero. Has anyone seen this error before? Regards, Bryan Jeffrey

Re: Error w/ Invertable ReduceByKeyAndWindow

2016-02-01 Thread Bryan Jeffrey
Excuse me - I should have mentioned: I am running Spark 1.4.1, Scala 2.11. I am running in streaming mode receiving data from Kafka. Regards, Bryan Jeffrey On Mon, Feb 1, 2016 at 9:19 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I have a reduceByKeyAnd

Re: Hive error after update from 1.4.1 to 1.5.2

2015-12-16 Thread Bryan Jeffrey
let me know what was the > resolution? > > Thanks, > Ashwin > > On Fri, Nov 20, 2015 at 12:07 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> > wrote: > >> Nevermind. I had a library dependency that still had the old Spark >> version. >> >> On Fr

DateTime Support - Hive Parquet

2015-11-23 Thread Bryan Jeffrey
with 1.5.2 - however, I am still seeing the associated errors. Is there a bug I can follow to determine when DateTime will be supported for Parquet? Regards, Bryan Jeffrey

Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Hello. I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to 1.5.2. Has anyone seen this issue? I'm invoking the following: new HiveContext(sc) // sc is a Spark Context I am seeing the following error: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding

Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
The 1.5.2 Spark was compiled using the following options: mvn -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive -Phive-thriftserver clean package Regards, Bryan Jeffrey On Fri, Nov 20, 2015 at 2:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I'm

Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Nevermind. I had a library dependency that still had the old Spark version. On Fri, Nov 20, 2015 at 2:14 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > The 1.5.2 Spark was compiled using the following options: mvn > -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Ph

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
master spark://10.0.0.4:7077 --packages com.datastax.spark:spark-cassandra-connector_2.11:1.5.0-M1 --hiveconf "spark.cores.max=2" --hiveconf "spark.executor.memory=2g" Do I perhaps need to include an additional library to do the default conversion? Regards, Bryan Jeffrey On Th

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Yes, I do - I found your example of doing that later in your slides. Thank you for your help! On Thu, Nov 12, 2015 at 12:20 PM, Mohammed Guller <moham...@glassbeam.com> wrote: > Did you mean Hive or Spark SQL JDBC/ODBC server? > > > > Mohammed > > > > *From:*

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
TIONS ( keyspace "c2", table "detectionresult" ); ]Error: java.io.IOException: Failed to open native connection to Cassandra at {10.0.0.4}:9042 (state=,code=0) This seems to be connecting to local host regardless of the value I set spark.cassandra.connection.host to. Regards,

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Mohammed, That is great. It looks like a perfect scenario. Would I be able to make the created DF queryable over the Hive JDBC/ODBC server? Regards, Bryan Jeffrey On Wed, Nov 11, 2015 at 9:34 PM, Mohammed Guller <moham...@glassbeam.com> wrote: > Short answer: yes. > > > >

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Answer: In beeline run the following: SET spark.cassandra.connection.host="10.0.0.10" On Thu, Nov 12, 2015 at 1:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Mohammed, > > While you're willing to answer questions, is there a trick to getting the > H

Spark Dynamic Partitioning Bug

2015-11-05 Thread Bryan Jeffrey
the manually calculated fields are correct. However, the dynamically calculated (string) partition for idAndSource is a random field from within my case class. I've duplicated this with several other classes and have seen the same result (I use this example because it's very simple). Any idea if this is a known bug? Is there a workaround? Regards, Bryan Jeffrey

Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Bryan Jeffrey
SparkConf().set("spark.driver.allowMultipleContexts", "true").setAppName(appName).setMaster(master) new StreamingContext(conf, Seconds(seconds)) } Regards, Bryan Jeffrey On Wed, Nov 4, 2015 at 9:49 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Are you trying to sp

SparkSQL implicit conversion on insert

2015-11-02 Thread Bryan Jeffrey
conversion prior to insertion? Regards, Bryan Jeffrey

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-30 Thread Bryan Jeffrey
Deenar, This worked perfectly - I moved to SQL Server and things are working well. Regards, Bryan Jeffrey On Thu, Oct 29, 2015 at 8:14 AM, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > Hi Bryan > > For your use case you don't need to have multiple metastores. The defa

Hive Version

2015-10-28 Thread Bryan Jeffrey
of the Spark documentation, but do not see version specified anywhere - it would be a good addition. Thank you, Bryan Jeffrey

Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
essible (and not partitioned). Is there a straightforward way to write to partitioned tables using Spark SQL? I understand that the read performance for partitioned data is far better - are there other performance improvements that might be better to use instead of partitioning? Regards, Bryan Jeffrey

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
issues (3) When partitioning without maps I see frequent out of memory issues I'll update this email when I've got a more concrete example of problems. Regards, Bryan Jeffrey On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <suchenz...@gmail.com> wrote: > Have you tried partitionBy? >

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
every time. Is this a known issue? Is there a workaround? Regards, Bryan Jeffrey On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Susan, > > I did give that a shot -- I'm seeing a number of oddities: > > (1) 'Partition By' appears only accepts

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
MetadataTypedColumnsetSerDe | | InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat | | OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat | This seems like a pretty big bug associated with persistent tables. Am I missing a step somewhere? Thank you, Bryan Jeffrey On Wed, Oct 28, 2015 at 4:10

Spark SQL Persistent Table - joda DateTime Compatability

2015-10-27 Thread Bryan Jeffrey
me to a persistent Hive table accomplished? Has anyone else run into the same issue? Regards, Bryan Jeffrey

Re: Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Bryan Jeffrey
All, The error resolved to a bad version of jline pulling from Maven. The jline version is defined as 'scala.version' -- the 2.11 version does not exist in maven. Instead the following should be used: org.scala-lang jline 2.11.0-M3 Regards, Bryan Jeffrey

Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Bryan Jeffrey
All, I'm seeing the following error compiling Spark 1.4.1 w/ Scala 2.11 & Hive support. Any ideas? mvn -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive -Phive-thriftserver package [INFO] Spark Project Parent POM .. SUCCESS [4.124s] [INFO] Spark Launcher

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan Jeffrey
t the method calls, the function that is called for appears to be the same. I was hoping an example might shed some light on the issue. Regards, Bryan Jeffrey On Thu, Oct 8, 2015 at 7:04 AM, Aniket Bhatnagar <aniket.bhatna...@gmail.com > wrote: > Here is an example: > > val

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan Jeffrey
= initialRDD) > counts.print() > > Thanks, > Aniket > > > On Thu, Oct 8, 2015 at 5:48 PM Bryan Jeffrey <bryan.jeff...@gmail.com> > wrote: > >> Aniket, >> >> Thank you for the example - but that's not quite what I'm looking for. >

Re: Weird worker usage

2015-09-28 Thread Bryan Jeffrey
Nukunj, No, I'm not calling set w/ master at all. This ended up being a foolish configuration problem with my slaves file. Regards, Bryan Jeffrey On Fri, Sep 25, 2015 at 11:20 PM, N B <nb.nos...@gmail.com> wrote: > Bryan, > > By any chance, are you calling SparkConf.s

Re: Weird worker usage

2015-09-25 Thread Bryan Jeffrey
parkcheckpoint --broker kafkaBroker:9092 --topic test --numStreams 9 --threadParallelism 9 Even when I put a long-running job in the queue, none of the other nodes are anything but idle. Am I missing something obvious? Regards, Bryan Jeffrey On Fri, Sep 25, 2015 at 8:28 AM, Akhil Das <a

Re: Weird worker usage

2015-09-25 Thread Bryan Jeffrey
45 INFO SparkContext: Running Spark version 1.4.1 15/09/25 16:45:45 INFO SparkContext: Spark configuration: spark.app.name=MainClass spark.default.parallelism=6 spark.driver.supervise=true spark.jars=file:/tmp/OinkSpark-1.0-SNAPSHOT-jar-with-dependencies.jar spark.logConf=true spark.master=local[*] spark.rpc.askTimeout=10 spark.streaming.receiver.maxRate=500 As you can see, despite -Dmaster=spark://sparkserver:7077, the streaming context still registers the master as local[*]. Any idea why? Thank you, Bryan Jeffrey

Yarn Shutting Down Spark Processing

2015-09-22 Thread Bryan Jeffrey
need to change to allow Yarn to initialize Spark streaming (vs. batch) jobs? Thank you, Bryan Jeffrey

Suggested Method for Execution of Periodic Actions

2015-09-16 Thread Bryan Jeffrey
counts to a database. Is there a built in mechanism or established pattern to execute periodic jobs in spark streaming? Regards, Bryan Jeffrey

Problems with Local Checkpoints

2015-09-09 Thread Bryan Jeffrey
en something similar? Regards, Bryan Jeffrey

Java vs. Scala for Spark

2015-09-08 Thread Bryan Jeffrey
for Spark dev in an enterprise environment? What was the outcome? Regards, Bryan Jeffrey

Re: Java vs. Scala for Spark

2015-09-08 Thread Bryan Jeffrey
Thank you for the quick responses. It's useful to have some insight from folks already extensively using Spark. Regards, Bryan Jeffrey On Tue, Sep 8, 2015 at 10:28 AM, Sean Owen <so...@cloudera.com> wrote: > Why would Scala vs Java performance be different Ted? Relatively &

Getting Started with Spark

2015-09-08 Thread Bryan Jeffrey
Hello. We're getting started with Spark Streaming. We're working to build some unit/acceptance testing around functions that consume DStreams. The current method for creating DStreams is to populate the data by creating an InputDStream: val input = Array(TestDataFactory.CreateEvent(123