date:20140801

Re:Re: Re:Re: [GraphX] The best way to construct a graph

2014-08-01 Thread Bin

Thanks for the advice. But since I am not the administrator of our spark cluster, I can't do this. Is there any better solution based on the current spark? At 2014-08-01 02:38:15, shijiaxin shijiaxin...@gmail.com wrote: Have you tried to write another similar function like edgeListFile in the

Re: [GraphX] The best way to construct a graph

2014-08-01 Thread Ankur Dave

At 2014-08-01 11:23:49 +0800, Bin wubin_phi...@126.com wrote: I am wondering what is the best way to construct a graph? Say I have some attributes for each user, and specific weight for each user pair. The way I am currently doing is first read user information and edge triple into two

RDD to DStream

2014-08-01 Thread Aniket Bhatnagar

Sometimes it is useful to convert a RDD into a DStream for testing purposes (generating DStreams from historical data, etc). Is there an easy way to do this? I could come up with the following inefficient way but no sure if there is a better way to achieve this. Thoughts? class

Iterator over RDD in PySpark

2014-08-01 Thread Andrei

Is there a way to get iterator from RDD? Something like rdd.collect(), but returning lazy sequence and not single array. Context: I need to GZip processed data to upload it to Amazon S3. Since archive should be a single file, I want to iterate over RDD, writing each line to a local .gz file. File

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-08-01 Thread chenjie

I used the web ui of spark and could see the conf directory is in CLASSPATH. An abnormal thing is that when start spark-shell I always get the following info: WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable At first, I

Re: configuration needed to run twitter(25GB) dataset

2014-08-01 Thread shijiaxin

When I use fewer partitions, (like 6) It seems that all the task will be assigned to the same machine, because the machine has more than 6 cores.But this will run out of memory. How to set fewer partitions number and use all the machine at the same time? -- View this message in context:

Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Ankur Dave

Attempting to build Spark from source on EC2 using sbt gives the error sbt.ResolveException: unresolved dependency: org.scala-lang#scala-library;2.10.2: not found. This only seems to happen on EC2, not on my local machine. To reproduce, launch a cluster using spark-ec2, clone the Spark

Re: Readin from Amazon S3 behaves inconsistently: return different number of lines...

2014-08-01 Thread Sean Owen

See https://issues.apache.org/jira/browse/SPARK-2579 It also was mentioned on the mailing list a while ago, and have heard tell of this from customers. I am trying to get to the bottom of it too. What version are you using, to start? I am wondering if it was fixed in 1.0.x since I was not able

Spark 0.9.2 sbt build issue

2014-08-01 Thread Arun Kumar

Hi While trying to build spark0.9.2 using sbt the build is failing due to the non resolving of most of the libraries .sbt cannot fetch the libraries in the specified location. Please tel me what changes are required to build spark using sbt Regards Arun

Re: access hdfs file name in map()

2014-08-01 Thread Roberto Torella

Hi Simon, I'm trying to do the same but I'm quite lost. How did you do that? (Too direct? :) Thanks and ciao, r- -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-tp6551p11160.html Sent from the Apache Spark User List mailing

Re: Issue with Spark on EC2 using spark-ec2 script

2014-08-01 Thread Dean Wampler

It looked like you were running in standalone mode (master set to local[4]). That's how I ran it. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler

Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin

Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala. We've tried to come up with a solution and it is working but it doesn't seem good. So I was wondering if you guys could tell us what is the correct way to do this. We are using Spark 1.0

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin

Sorry, sent early, wasn't finished typing. CREATE EXTERNAL TABLE Then we can select the data using Impala. But this is registered as an external table and must be refreshed if new data is inserted. Obviously this doesn't seem good and doesn't seem like the correct solution. How should we

Re: streaming window not behaving as advertised (v1.0.1)

2014-08-01 Thread Venkat Subramanian

TD, We are seeing the same issue. We struggled through this until we found this post and the work around. A quick fix in the Spark Streaming software will help a lot for others who are encountering this and pulling their hair out on why RDD on some partitions are not computed (we ended up

Re: Hbase

2014-08-01 Thread Akhil Das

Here's a piece of code. In your case, you are missing the call() method inside the map function. import java.util.Iterator; import java.util.List; import org.apache.commons.configuration.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import

Tasks fail when ran in cluster but they work fine when submited using local local

2014-08-01 Thread salemi

Hi All, My application works when I use the spark-submit with master=local[*]. But if I deploy the application to a standalone cluster master=spark://master:7077 that the application doesn't work and I get the following exception: 14/08/01 05:18:51 ERROR TaskSchedulerImpl: Lost executor 0 on

Re: streaming window not behaving as advertised (v1.0.1)

2014-08-01 Thread RodrigoB

Hi TD, I've also been fighting this issue only to find the exact same solution you are suggesting. Too bad I didn't find either the post or the issue sooner. I'm using a 1 second batch with N amount of kafka events (1 to 1 with the state objects) per batch and only calling the updatestatebykey

What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-01 Thread Nicholas Chammas

[Forking this thread.] According to the Spark Programming Guide http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence, persisting RDDs with MEMORY_ONLY should not choke if the RDD cannot be held entirely in memory: If the RDD does not fit in memory, some partitions will not

RE: Data from Mysql using JdbcRDD

2014-08-01 Thread srinivas

Hi Thanks Alli have few more questions on this suppose i don't want to pass where caluse in my sql and is their a way that i can do this. Right now i am trying to modify JdbcRDD class by removing all the paramaters for lower bound and upper bound. But i am getting run time exceptions. Is

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-01 Thread Sean Owen

Isn't this your worker running out of its memory for computations, rather than for caching RDDs? so it has enough memory when you don't actually use a lot of the heap for caching, but when the cache uses its share, you actually run out of memory. If I'm right, and even I am not sure I have this

Spark SQL Query Plan optimization

2014-08-01 Thread N . Venkata Naga Ravi

Hi, I am trying to understand the query plan and number of tasks /execution time created for joined query. Consider following example , creating two tables emp, sal with appropriate 100 records in each table with key for joining them. EmpRDDRelation.scala case class EmpRecord(key:

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-01 Thread Nicholas Chammas

On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen so...@cloudera.com wrote: Isn't this your worker running out of its memory for computations, rather than for caching RDDs? I’m not sure how to interpret the stack trace, but let’s say that’s true. I’m even seeing this with a simple a =

Re: RDD to DStream

2014-08-01 Thread Aniket Bhatnagar

Hi everyone I haven't been receiving replies to my queries in the distribution list. Not pissed but I am actually curious to know if my messages are actually going through or not. Can someone please confirm that my msgs are getting delivered via this distribution list? Thanks, Aniket On 1

Re: Extracting an element from the feature vector in LabeledPoint

2014-08-01 Thread SK

I am using 1.0.1. It does not matter to me whether it is the first or second element. I would like to know how to extract the i-th element in the feature vector (not the label). data.features(i) gives the following error: method apply in trait Vector cannot be accessed in

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Michael Armbrust

So is the only issue that impala does not see changes until you refresh the table? This sounds like a configuration that needs to be changed on the impala side. On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin mcgloin.patr...@gmail.com wrote: Sorry, sent early, wasn't finished typing. CREATE

Re: access hdfs file name in map()

2014-08-01 Thread Xu (Simon) Chen

Hi Roberto, Ultimately, the info you need is set here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69 Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as HadoopRDDWithEnv, which takes in an additional parameter

Re: Extracting an element from the feature vector in LabeledPoint

2014-08-01 Thread Sean Owen

Oh I'm sorry, I somehow misread your email as looking for the label. I read too fast. That was pretty silly. THis works for me though: scala val point = LabeledPoint(1,Vectors.dense(2,3,4)) point: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[2.0,3.0,4.0]) scala point.features(1) res10:

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson

rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though. On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote: Is there a way to get iterator from RDD?

Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep

We are using Sparks 1.0 for Spark Streaming on Spark Standalone cluster and seeing the following error. Job aborted due to stage failure: Task 3475.0:15 failed 4 times, most recent failure: Exception failure in TID 216394 on host hslave33102.sjc9.service-now.com: java.lang.Exception:

persisting RDD in memory

2014-08-01 Thread Sujee Maniyam

Hi all, I have a scenario of a web application submitting multiple jobs to Spark. These jobs may be operating on the same RDD. It is possible to cache() the RDD during one call... And all subsequent calls can use the cached RDD? basically, during one invocation val rdd1 =

RE: Example standalone app error!

2014-08-01 Thread Alex Minnaar

I think this is the problem. I was working in a project that inherited some other Akka dependencies (of a different version). I'm switching to a fresh new project which should solve the problem. Thanks, Alex From: Tathagata Das

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread nit

I also ran into same issue. What is the solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Issue using kryo serilization

2014-08-01 Thread gpatcham

any pointers to this issue. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-using-kryo-serilization-tp11129p11191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Computing mean and standard deviation by key

2014-08-01 Thread kriskalish

I have what seems like a relatively straightforward task to accomplish, but I cannot seem to figure it out from the Spark documentation or searching the mailing list. I have an RDD[(String, MyClass)] that I would like to group by the key, and calculate the mean and standard deviation of the foo

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das

Are you accessing the RDDs on raw data blocks and running independent Spark jobs on them (that is outside DStream)? In that case this may happen as Spark Straming will clean up the raw data based on the DStream operations (if there is a window op of 15 mins, it will keep the data around for 15

correct upgrade process

2014-08-01 Thread SK

Hi, I upgraded to 1.0.1 from 1.0 a couple of weeks ago and have been able to use some of the features advertised in 1.0.1. However, I get some compilation errors in some cases and based on user response, these errors have been addressed in the 1.0.1 version and so I should not be getting these

Re: correct upgrade process

2014-08-01 Thread Matei Zaharia

This should be okay, but make sure that your cluster also has the right code deployed. Maybe you have the wrong one. If you built Spark from source multiple times, you may also want to try sbt clean before sbt assembly. Matei On August 1, 2014 at 12:00:07 PM, SK (skrishna...@gmail.com) wrote:

Re: Computing mean and standard deviation by key

2014-08-01 Thread Kristopher Kalish

The reason I want an RDD is because I'm assuming that iterating the individual elements of an RDD on the driver of the cluster is much slower than coming up with the mean and standard deviation using a map-reduce-based algorithm. I don't know the intimate details of Spark's implementation, but it

Re: Computing mean and standard deviation by key

2014-08-01 Thread Sean Owen

You're certainly not iterating on the driver. The Iterable you process in your function is on the cluster and done in parallel. On Fri, Aug 1, 2014 at 8:36 PM, Kristopher Kalish k...@kalish.net wrote: The reason I want an RDD is because I'm assuming that iterating the individual elements of an

Re: Hbase

2014-08-01 Thread Madabhattula Rajesh Kumar

Hi Akhil, Thank you very much for your help and support. Regards, Rajesh On Fri, Aug 1, 2014 at 7:57 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Here's a piece of code. In your case, you are missing the call() method inside the map function. import java.util.Iterator; import

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann

It is definitely possible to run multiple workers on a single node and have each worker with the maximum number of cores (e.g. if you have 8 cores and 2 workers you'd have 16 cores per node). I don't know if it's possible with the out of the box scripts though. It's actually not really that

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks

Computing the variance is similar to this example, you just need to keep around the sum of squares as well. The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of

Re: Computing mean and standard deviation by key

2014-08-01 Thread Sean Owen

Here's the more functional programming-friendly take on the computation (but yeah this is the naive formula): rdd.groupByKey.mapValues { mcs = val values = mcs.map(_.foo.toDouble) val n = values.count val sum = values.sum val sumSquares = values.map(x = x * x).sum math.sqrt(n *

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep

We are using Sparks 1.0. I'm using DStream operations such as map, filter and reduceByKeyAndWindow and doing a foreach operation on DStream. -- View this message in context:

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks

Ignoring my warning about overflow - even more functional - just use a reduceByKey. Since your main operation is just a bunch of summing, you've got a commutative-associative reduce operation and spark will run do everything cluster-parallel, and then shuffle the (small) result set and merge

How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread bumble123

Hi, I've seen many threads about reading from HBase into Spark, but none about how to read from OpenTSDB into Spark. Does anyone know anything about this? I tried looking into it, but I think OpenTSDB saves its information into HBase using hex and I'm not sure how to interpret the data. If you

Re: correct upgrade process

2014-08-01 Thread SK

Hi, So I again ran sbt clean followed by all of the steps listed above to rebuild the jars after cleaning. My compilation error still persists. Specifically, I am trying to extract an element from the feature vector that is part of a LabeledPoint as follows: data.features(i) This gives the

Re: Iterator over RDD in PySpark

2014-08-01 Thread Andrei

Thanks, Aaron, it should be fine with partitions (I can repartition it anyway, right?). But rdd.toLocalIterator is purely Java/Scala method. Is there Python interface to it? I can get Java iterator though rdd._jrdd, but it isn't converted to Python iterator automatically. E.g.: rdd =

Re: creating a distributed index

2014-08-01 Thread andy petrella

Hey, There is some work that started on IndexedRDD (on master I think). Meanwhile, checking what has been done in GraphX regarding vertex index in partitions could be worthwhile I guess Hth Andy Le 1 août 2014 22:50, Philip Ogren philip.og...@oracle.com a écrit : Suppose I want to take my large

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Holden Karau

Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran into same issue. What is the solution? -- View this message in context:

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Holden Karau

Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran into

Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread touchdown

Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small avro files into one avro file. I read it in with: sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](path) but I can't find saveAsHadoopFile or saveAsNewAPIHadoopFile. Can you

Re: creating a distributed index

2014-08-01 Thread Ankur Dave

At 2014-08-01 14:50:22 -0600, Philip Ogren philip.og...@oracle.com wrote: It seems that I could do this with mapPartition so that each element in a partition gets added to an index for that partition. [...] Would it then be possible to take a string and query each partition's index with it?

Re: Installing Spark 0.9.1 on EMR Cluster

2014-08-01 Thread Mayur Rustagi

Have you tried https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 Thr is also a 0.9.1 version they talked about in one of the meetups. Check out the s3 bucket inthe guide.. it should have a 0.9.1 version as well. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell

This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Currently scala 2.10.2 can't be pulled in from

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman

This fails for me too. I have no idea why it happens as I can wget the pom from maven central. To work around this I just copied the ivy xmls and jars from this github repo https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library and put it in

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman

Thanks Patrick -- It does look like some maven misconfiguration as wget http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom works for me. Shivaram On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote: This is a Scala bug - I filed

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell

I've had intermiddent access to the artifacts themselves, but for me the directory listing always 404's. I think if sbt hits a 404 on the directory, it sends a somewhat confusing error message that it can't download the artifact. - Patrick On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread Mayur Rustagi

What is the usecase you are looking at? Tsdb is not designed for you to query data directly from HBase, Ideally you should use REST API if you are looking to do thin analysis. Are you looking to do whole reprocessing of TSDB ? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: RDD to DStream

2014-08-01 Thread Mayur Rustagi

Nice question :) Ideally you should use a queuestream interface to push RDD into a queue then spark streaming can handle the rest. Though why are you looking to convert RDD to DStream, another workaround folks use is to source DStream from folders move files that they need reprocessed back into

Re: Accumulator and Accumulable vs classic MR

2014-08-01 Thread Mayur Rustagi

Only blocker is accumulator can be only added to from slaves only read on the master. If that constraint fit you well you can fire away. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Aug 1, 2014 at 7:38 AM, Julien

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep

All the operations being done are using the dstream. I do read an RDD in memory which is collected and converted into a map and used for lookups as part of DStream operations. This RDD is loaded only once and converted into map that is then used on streamed data. Do you mean non streaming jobs on

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das

I meant are you using RDD generated by DStreams, in Spark jobs out side the DStreams computation? Something like this: var globalRDD = null dstream.foreachRDD(rdd = // have a global pointer based on the rdds generate by dstream if (runningFirstTime) globalRDD = rdd ) ssc.start() .

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep

Not at all. Don't have any such code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11231.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread bumble123

I'm trying to get metrics out of TSDB so I can use Spark to do anomaly detection on graphs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-from-OpenTSDB-using-PySpark-or-Scala-Spark-tp11211p11232.html Sent from the Apache Spark User List

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread Mayur Rustagi

Http Api would be the best bet, I assume by graph you mean the charts created by tsdb frontends. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Aug 1, 2014 at 4:48 PM, bumble123 tc1...@att.com wrote: I'm trying to

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread bumble123

So is there no way to do this through SparkStreaming? Won't I have to do batch processing if I use the http api rather than getting it directly into Spark? -- View this message in context:

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson

Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it = rdd._collect_iterator_through_file(javaIterator) On Fri, Aug 1, 2014 at 3:04 PM, Andrei faithlessfri...@gmail.com wrote:

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das

Then could you try giving me a log. And as a workaround, disable spark.streaming.unpersist = false On Fri, Aug 1, 2014 at 4:10 PM, Kanwaldeep kanwal...@gmail.com wrote: Not at all. Don't have any such code. -- View this message in context:

Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread Ron Gonzalez

You have to import org.apache.spark.rdd._, which will automatically make available this method. Thanks, Ron Sent from my iPhone On Aug 1, 2014, at 3:26 PM, touchdown yut...@gmail.com wrote: Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small avro files into one

Re: Computing mean and standard deviation by key

2014-08-01 Thread Ron Gonzalez

Can you share the mapValues approach you did? Thanks, Ron Sent from my iPhone On Aug 1, 2014, at 3:00 PM, kriskalish k...@kalish.net wrote: Thanks for the help everyone. I got the mapValues approach working. I will experiment with the reduceByKey approach later. 3 -Kris --

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep

Here is the log file. streaming.gz http://apache-spark-user-list.1001560.n3.nabble.com/file/n11240/streaming.gz There are quite few AskTimeouts that have happening for about 2 minutes and then followed by block not found errors. Thanks Kanwal -- View this message in context:

Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread touchdown

Yes, I saw that after I looked at it closer. Thanks! But I am running into a schema not set error: Writer schema for output key was not set. Use AvroJob.setOutputKeySchema() I am in the process of figuring out how to set schema for an AvroJob from a HDFS file, but any pointer is much appreciated!

73 matches

Mail list logo