in https://issues.apache.org/j
>>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata
>>> /spark-sorted (not part of spark, but it is available right now).
>>> Hopefully thats what you are looking for. To the best of my knowledge that
>>>
rmations
> to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney > wrote:
>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira > > escribió:
>>
>>> Hi,
>>>
>>>
Threads
El viernes, 15 de enero de 2016, Kira escribió:
> Hi,
>
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
> be done ?
>
> Thank you,
> Regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-
reading the code, is there any reason why
setting spark.cleaner.ttl.MAP_OUTPUT_TRACKER directly won't get picked up?
2015-11-17 14:45 GMT-05:00 Jonathan Coveney :
> so I have the following...
>
> broadcast some stuff
> cache an rdd
> do a bunch of stuff, eventually calling
so I have the following...
broadcast some stuff
cache an rdd
do a bunch of stuff, eventually calling actions which reduce it to an
acceptable size
I'm getting an OOM on the driver (well, GC is getting out of control),
largely because I have a lot of partitions and it looks like the job
history is
unclear as to why it works with 2.11.7 and not 2.11.6.
>
> Thanks,
> Babar
>
> On Mon, Nov 2, 2015 at 2:10 PM Jonathan Coveney > wrote:
>
>> Caused by: java.lang.ClassNotFoundException: scala.Some
>>
>> indicates that you don't have the scala libs presen
Caused by: java.lang.ClassNotFoundException: scala.Some
indicates that you don't have the scala libs present. How are you executing
this? My guess is the issue is a conflict between scala 2.11.6 in your
build and 2.11.7? Not sure...try setting your scala to 2.11.7?
But really, first it'd be good
Additionally, I'm curious if there are any JIRAS around making dataframes
support ordering better? there are a lot of operations that can be
optimized if you know that you have a total ordering on your data...are
there any plans, or at least JIRAS, around having the catalyst optimizer
handle this c
do you have JAVA_HOME set to a java 7 jdk?
2015-10-23 7:12 GMT-04:00 emlyn :
> xjlin0 wrote
> > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
> > or without Hadoop or home compiled with ant or maven). There was no
> error
> > message in v1.4.x, system prompt nothing.
I've noticed this as well and am curious if there is anything more people
can say.
My theory is that it is just communication overhead. If you only have a
couple of gigabytes (a tiny dataset), then spotting that into 50 nodes
means you'll have a ton of tiny partitions all finishing very quickly, a
LZO files are not splittable by default but there are projects with Input
and Output formats to make splittable LZO files. Check out twitter's
elephantbird on GitHub
El miércoles, 7 de octubre de 2015, Mohammed Guller
escribió:
> It is not uncommon to process datasets larger than available memor
LZO files are not splittable by default but there are projects with Input
and Output formats to make splittable LZO files. Check out twitter's
elephantbird on GitHub
El miércoles, 7 de octubre de 2015, Mohammed Guller
escribió:
> It is not uncommon to process datasets larger than available memor
Nobody is saying not to use immutable data structures, only that guava's
aren't natively supported.
Scala's default collections library is all immutable. list, Vector, Map.
This is what people generally use, especially in scala code!
El martes, 6 de octubre de 2015, Jakub Dubovsky <
spark.dubovsk
You can put a class in the org.apache.spark namespace to access anything
that is private[spark]. You can then make enrichments there to access
whatever you need. Just beware upgrade pain :)
El martes, 6 de octubre de 2015, Erwan ALLAIN
escribió:
> Hello,
>
> I'm currently testing spark streaming
It's highly conceivable to be able to beat spark in performance on tiny
data sets like this. That's not really what it has been optimized for.
El martes, 22 de septiembre de 2015, juljoin
escribió:
> Hello,
>
> I am trying to figure Spark out and I still have some problems with its
> speed, I ca
having a file per record is pretty inefficient on almost any file system
El martes, 22 de septiembre de 2015, Daniel Haviv <
daniel.ha...@veracity-group.com> escribió:
> Hi,
> We are trying to load around 10k avro files (each file holds only one
> record) using spark-avro but it takes over 15 min
It worked for Twitter!
Seriously though: scala is much much more pleasant. And scala has a great
story for using Java libs. And since spark is kind of framework-y (use its
scripts to submit, start up repl, etc) the projects tend to be lead
projects, so even in a big company that uses Java the cost
scalaVersion := "2.11.6"
>
> libraryDependencies ++= Seq(
> "org.apache.spark" %% "spark-core" % "1.4.1" % "provided",
> "org.apache.spark" %% "spark-sql" % "1.4.1" % "provided",
> &
Try adding the following to your build.sbt
libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.6"
I believe that spark shades the scala library, and this is a library
that it looks like you need in an unshaded way.
2015-09-07 16:48 GMT-04:00 Gheorghe Postelnicu <
gheorghe.posteln
You can make a Hadoop input format which passes through the name of the
file. I generally find it easier to just hit Hadoop, get the file names,
and construct the RDDs though
El martes, 1 de septiembre de 2015, Matt K escribió:
> Just want to add - I'm looking to partition the resulting Parquet
array[String] doesn't pretty print by default. Use .mkString(",") for
example
El jueves, 27 de agosto de 2015, Arun Luthra
escribió:
> What types of RDD can saveAsObjectFile(path) handle? I tried a naive test
> with an RDD[Array[String]], but when I tried to read back the result with
> sc.object
I've used the instructions and it worked fine.
Can you post exactly what you're doing, and what it fails with? Or are you
just trying to understand how it works?
2015-08-24 15:48 GMT-04:00 Lanny Ripple :
> Hello,
>
> The instructions for building spark against scala-2.11 indicate using
> -Dspark
Put a log4j.properties file in conf/. You can copy
log4j.properties.template as a good base
El miércoles, 29 de julio de 2015, canan chen escribió:
> Anyone know how to set log level in spark-submit ? Thanks
>
That's great! Thanks
El martes, 28 de julio de 2015, Ted Yu escribió:
> If I understand correctly, there would be one value in the executor.
>
> Cheers
>
> On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney > wrote:
>
>> i am running in coarse grained mode, let
i am running in coarse grained mode, let's say with 8 cores per executor.
If I use a broadcast variable, will all of the tasks in that executor share
the same value? Or will each task broadcast its own value ie in this case,
would there be one value in the executor shared by the 8 tasks, or would
Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos
0.19.0)...
Regardless, I'm running into a really weird situation where when I pass
--jars to bin/spark-shell I can't reference those classes on the repl. Is
this expected? The logs even tell me that my jars have been added, and
what about .groupBy doesn't work for you?
2015-05-07 8:17 GMT-04:00 Night Wolf :
> MyClass is a basic scala case class (using Spark 1.3.1);
>
> case class Result(crn: Long, pid: Int, promoWk: Int, windowKey: Int, ipi:
> Double) {
> override def hashCode(): Int = crn.hashCode()
> }
>
>
> On Wed
A helpful example of how to convert:
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
As far as performance, that depends on your data. If you have a lot of
columns and use all of them, parquet deserialization is expensive. If you
have a column and only need a few (
Can you check your local and remote logs?
2015-05-06 16:24 GMT-04:00 Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com>:
> This problem happen in Spark 1.3.1. It happen when two jobs are running
> simultaneously each in its own Spark Context.
>
>
>
> I don’t remember seeing this bug in Spar
ble file format?
>
> Any suggestion from your experience how to organize data in splittable
> file format on HDFS for Spark?
>
> Rendy
> On May 6, 2015 1:03 AM, "Jonathan Coveney" > wrote:
>
>> "As per my understanding, storing 5minutes file means we coul
"As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minutes."
This depends on the file format. Many file formats are splittable (like
parquet), meaning that you can seek into various points of the file.
2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior
not fit. I'd be interested to hear more about a workload where that's
> relevant though, before going that route. Maybe if people are using
> SSD's that would make sense.
>
> - Patrick
>
> On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney
> wrote:
> > I'
in either RDD, that has itemID = 0 or null ?
> And what is catch all ?
>
> That implies is it a good idea to run a filter on each RDD first ? We do
> not do this using Pig on M/R. Is it required in Spark world ?
>
> On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney
> wrote:
&g
ilter on each RDD first ? We do
> not do this using Pig on M/R. Is it required in Spark world ?
>
> On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney
> wrote:
>
>> My guess would be data skew. Do you know if there is some item id that is
>> a catch all? can it be null? it
My guess would be data skew. Do you know if there is some item id that is a
catch all? can it be null? item id 0? lots of data sets have this sort of
value and it always kills joins
2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :
> Code:
>
> val viEventsWithListings: RDD[(Long, (DetailInputRecord, VIS
I need to have my own scheduler to point to a proprietary remote execution
framework.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2152
I'm looking at where it decides on the backend and it doesn't look like
there is a hook. Of course I can
I'm surprised that I haven't been able to find this via google, but I
haven't...
What is the setting that requests some amount of disk space for the
executors? Maybe I'm misunderstanding how this is configured...
Thanks for any help!
n behind this.
>
> Thanks.
>
> Zhan Zhang
>
>
>
>
> On Mar 26, 2015, at 2:49 PM, Jonathan Coveney wrote:
>
> I believe if you do the following:
>
>
> sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByK
I believe if you do the following:
sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
(8) MapPartitionsRDD[34] at reduceByKey at :23 []
| MapPartitionsRDD[33] at mapValues at :23 []
| ShuffledRDD[32] at reduceByKey at :
Hello all,
I am wondering if spark already has support for optimizations on sorted
data and/or if such support could be added (I am comfortable dropping to a
lower level if necessary to implement this, but I'm not sure if it is
possible at all).
Context: we have a number of data sets which are es
40 matches
Mail list logo