from:"\"Jonathan Coveney\""

Re: can spark take advantage of ordered data?

2017-03-10 Thread Jonathan Coveney

in https://issues.apache.org/j >>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata >>> /spark-sorted (not part of spark, but it is available right now). >>> Hopefully thats what you are looking for. To the best of my knowledge that >>>

Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney

rmations > to compute the required results in one single operation. > > On 15 January 2016 at 06:18, Jonathan Coveney > wrote: > >> Threads >> >> >> El viernes, 15 de enero de 2016, Kira > > escribió: >> >>> Hi, >>> >>>

Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney

Threads El viernes, 15 de enero de 2016, Kira escribió: > Hi, > > Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this > be done ? > > Thank you, > Regards > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-

Re: Is there a way to delete task history besides using a ttl?

2015-11-17 Thread Jonathan Coveney

reading the code, is there any reason why setting spark.cleaner.ttl.MAP_OUTPUT_TRACKER directly won't get picked up? 2015-11-17 14:45 GMT-05:00 Jonathan Coveney : > so I have the following... > > broadcast some stuff > cache an rdd > do a bunch of stuff, eventually calling

Is there a way to delete task history besides using a ttl?

2015-11-17 Thread Jonathan Coveney

so I have the following... broadcast some stuff cache an rdd do a bunch of stuff, eventually calling actions which reduce it to an acceptable size I'm getting an OOM on the driver (well, GC is getting out of control), largely because I have a lot of partitions and it looks like the job history is

Re: Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-11-02 Thread Jonathan Coveney

unclear as to why it works with 2.11.7 and not 2.11.6. > > Thanks, > Babar > > On Mon, Nov 2, 2015 at 2:10 PM Jonathan Coveney > wrote: > >> Caused by: java.lang.ClassNotFoundException: scala.Some >> >> indicates that you don't have the scala libs presen

Re: Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-11-02 Thread Jonathan Coveney

Caused by: java.lang.ClassNotFoundException: scala.Some indicates that you don't have the scala libs present. How are you executing this? My guess is the issue is a conflict between scala 2.11.6 in your build and 2.11.7? Not sure...try setting your scala to 2.11.7? But really, first it'd be good

Re: Sort Merge Join

2015-11-02 Thread Jonathan Coveney

Additionally, I'm curious if there are any JIRAS around making dataframes support ordering better? there are a lot of operations that can be optimized if you know that you have a total ordering on your data...are there any plans, or at least JIRAS, around having the catalyst optimizer handle this c

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread Jonathan Coveney

do you have JAVA_HOME set to a java 7 jdk? 2015-10-23 7:12 GMT-04:00 emlyn : > xjlin0 wrote > > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with > > or without Hadoop or home compiled with ant or maven). There was no > error > > message in v1.4.x, system prompt nothing.

Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney

I've noticed this as well and am curious if there is anything more people can say. My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly, a

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Jonathan Coveney

LZO files are not splittable by default but there are projects with Input and Output formats to make splittable LZO files. Check out twitter's elephantbird on GitHub El miércoles, 7 de octubre de 2015, Mohammed Guller escribió: > It is not uncommon to process datasets larger than available memor

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Jonathan Coveney

LZO files are not splittable by default but there are projects with Input and Output formats to make splittable LZO files. Check out twitter's elephantbird on GitHub El miércoles, 7 de octubre de 2015, Mohammed Guller escribió: > It is not uncommon to process datasets larger than available memor

Re: RDD of ImmutableList

2015-10-06 Thread Jonathan Coveney

Nobody is saying not to use immutable data structures, only that guava's aren't natively supported. Scala's default collections library is all immutable. list, Vector, Map. This is what people generally use, especially in scala code! El martes, 6 de octubre de 2015, Jakub Dubovsky < spark.dubovsk

Re: does KafkaCluster can be public ?

2015-10-06 Thread Jonathan Coveney

You can put a class in the org.apache.spark namespace to access anything that is private[spark]. You can then make enrichments there to access whatever you need. Just beware upgrade pain :) El martes, 6 de octubre de 2015, Erwan ALLAIN escribió: > Hello, > > I'm currently testing spark streaming

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Jonathan Coveney

It's highly conceivable to be able to beat spark in performance on tiny data sets like this. That's not really what it has been optimized for. El martes, 22 de septiembre de 2015, juljoin escribió: > Hello, > > I am trying to figure Spark out and I still have some problems with its > speed, I ca

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Jonathan Coveney

having a file per record is pretty inefficient on almost any file system El martes, 22 de septiembre de 2015, Daniel Haviv < daniel.ha...@veracity-group.com> escribió: > Hi, > We are trying to load around 10k avro files (each file holds only one > record) using spark-avro but it takes over 15 min

Re: Java vs. Scala for Spark

2015-09-08 Thread Jonathan Coveney

It worked for Twitter! Seriously though: scala is much much more pleasant. And scala has a great story for using Java libs. And since spark is kind of framework-y (use its scripts to submit, start up repl, etc) the projects tend to be lead projects, so even in a big company that uses Java the cost

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Jonathan Coveney

scalaVersion := "2.11.6" > > libraryDependencies ++= Seq( > "org.apache.spark" %% "spark-core" % "1.4.1" % "provided", > "org.apache.spark" %% "spark-sql" % "1.4.1" % "provided", > &

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Jonathan Coveney

Try adding the following to your build.sbt libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.6" I believe that spark shades the scala library, and this is a library that it looks like you need in an unshaded way. 2015-09-07 16:48 GMT-04:00 Gheorghe Postelnicu < gheorghe.posteln

Re: extracting file path using dataframes

2015-09-01 Thread Jonathan Coveney

You can make a Hadoop input format which passes through the name of the file. I generally find it easier to just hit Hadoop, get the file names, and construct the RDDs though El martes, 1 de septiembre de 2015, Matt K escribió: > Just want to add - I'm looking to partition the resulting Parquet

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Jonathan Coveney

array[String] doesn't pretty print by default. Use .mkString(",") for example El jueves, 27 de agosto de 2015, Arun Luthra escribió: > What types of RDD can saveAsObjectFile(path) handle? I tried a naive test > with an RDD[Array[String]], but when I tried to read back the result with > sc.object

Re: spark and scala-2.11

2015-08-24 Thread Jonathan Coveney

I've used the instructions and it worked fine. Can you post exactly what you're doing, and what it fails with? Or are you just trying to understand how it works? 2015-08-24 15:48 GMT-04:00 Lanny Ripple : > Hello, > > The instructions for building spark against scala-2.11 indicate using > -Dspark

Re: How to set log level in spark-submit ?

2015-07-29 Thread Jonathan Coveney

Put a log4j.properties file in conf/. You can copy log4j.properties.template as a good base El miércoles, 29 de julio de 2015, canan chen escribió: > Anyone know how to set log level in spark-submit ? Thanks >

Re: broadcast variable question

2015-07-28 Thread Jonathan Coveney

That's great! Thanks El martes, 28 de julio de 2015, Ted Yu escribió: > If I understand correctly, there would be one value in the executor. > > Cheers > > On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney > wrote: > >> i am running in coarse grained mode, let

broadcast variable question

2015-07-28 Thread Jonathan Coveney

i am running in coarse grained mode, let's say with 8 cores per executor. If I use a broadcast variable, will all of the tasks in that executor share the same value? Or will each task broadcast its own value ie in this case, would there be one value in the executor shared by the 8 tasks, or would

--jars not working?

2015-06-12 Thread Jonathan Coveney

Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos 0.19.0)... Regardless, I'm running into a really weird situation where when I pass --jars to bin/spark-shell I can't reference those classes on the repl. Is this expected? The logs even tell me that my jars have been added, and

Re: Partition Case Class RDD without ParRDDFunctions

2015-05-07 Thread Jonathan Coveney

what about .groupBy doesn't work for you? 2015-05-07 8:17 GMT-04:00 Night Wolf : > MyClass is a basic scala case class (using Spark 1.3.1); > > case class Result(crn: Long, pid: Int, promoWk: Int, windowKey: Int, ipi: > Double) { > override def hashCode(): Int = crn.hashCode() > } > > > On Wed

Re: Avro to Parquet ?

2015-05-07 Thread Jonathan Coveney

A helpful example of how to convert: http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/ As far as performance, that depends on your data. If you have a lot of columns and use all of them, parquet deserialization is expensive. If you have a column and only need a few (

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Jonathan Coveney

Can you check your local and remote logs? 2015-05-06 16:24 GMT-04:00 Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com>: > This problem happen in Spark 1.3.1. It happen when two jobs are running > simultaneously each in its own Spark Context. > > > > I don’t remember seeing this bug in Spar

Re: Number of files to load

2015-05-05 Thread Jonathan Coveney

ble file format? > > Any suggestion from your experience how to organize data in splittable > file format on HDFS for Spark? > > Rendy > On May 6, 2015 1:03 AM, "Jonathan Coveney" > wrote: > >> "As per my understanding, storing 5minutes file means we coul

Re: Number of files to load

2015-05-05 Thread Jonathan Coveney

"As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minutes." This depends on the file format. Many file formats are splittable (like parquet), meaning that you can seek into various points of the file. 2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior

Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Jonathan Coveney

not fit. I'd be interested to hear more about a workload where that's > relevant though, before going that route. Maybe if people are using > SSD's that would make sense. > > - Patrick > > On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney > wrote: > > I'

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney

in either RDD, that has itemID = 0 or null ? > And what is catch all ? > > That implies is it a good idea to run a filter on each RDD first ? We do > not do this using Pig on M/R. Is it required in Spark world ? > > On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney > wrote: &g

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney

ilter on each RDD first ? We do > not do this using Pig on M/R. Is it required in Spark world ? > > On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney > wrote: > >> My guess would be data skew. Do you know if there is some item id that is >> a catch all? can it be null? it

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread Jonathan Coveney

My guess would be data skew. Do you know if there is some item id that is a catch all? can it be null? item id 0? lots of data sets have this sort of value and it always kills joins 2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) : > Code: > > val viEventsWithListings: RDD[(Long, (DetailInputRecord, VIS

What's the cleanest way to make spark aware of my custom scheduler?

2015-04-13 Thread Jonathan Coveney

I need to have my own scheduler to point to a proprietary remote execution framework. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2152 I'm looking at where it decides on the backend and it doesn't look like there is a hook. Of course I can

Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Jonathan Coveney

I'm surprised that I haven't been able to find this via google, but I haven't... What is the setting that requests some amount of disk space for the executors? Maybe I'm misunderstanding how this is configured... Thanks for any help!

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney

n behind this. > > Thanks. > > Zhan Zhang > > > > > On Mar 26, 2015, at 2:49 PM, Jonathan Coveney wrote: > > I believe if you do the following: > > > sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByK

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Jonathan Coveney

I believe if you do the following: sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString (8) MapPartitionsRDD[34] at reduceByKey at :23 [] | MapPartitionsRDD[33] at mapValues at :23 [] | ShuffledRDD[32] at reduceByKey at :

can spark take advantage of ordered data?

2015-03-11 Thread Jonathan Coveney

Hello all, I am wondering if spark already has support for optimizations on sorted data and/or if such support could be added (I am comfortable dropping to a lower level if necessary to implement this, but I'm not sure if it is possible at all). Context: we have a number of data sets which are es

40 matches

Mail list logo