rdd's which are no longer required will be removed from memory by spark
itself (which you can consider as lazy?).
Thanks
Best Regards
On Wed, Jul 1, 2015 at 7:48 PM, Jem Tucker jem.tuc...@gmail.com wrote:
Hi,
The current behavior of rdd.unpersist() appears to not be lazily executed
and
Please see below link for the ways available
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#performance-tuning
For example, reduce spark.sql.shuffle.partitions from 200 to 10 could
improve the performance significantly
--
View this message in context:
Team,
Got this fixed. After so much of juggling, when I replaced JRE7 with JRE8,
things started working the way intended.
Cheers!
On Wed, Jul 1, 2015 at 7:07 PM, Ramprakash Ramamoorthy
youngestachie...@gmail.com wrote:
Team,
I'm just playing around with spark and mllib. Installed scala and
Have a look at the sc.wholeTextFiles, you can use it to read the whole csv
contents into the value and then split it on \n and add them up to a list
and return it.
*sc.wholeTextFiles:*
Read a directory of text files from HDFS, a local file system (available on
all nodes), or any Hadoop-supported
You can keep a joined dataset cached and filter that joined df with your
filter condition
On 2 Jul 2015 15:01, Mailing List asoni.le...@gmail.com wrote:
I need to pass the value of the filter dynamically like where id=someVal
and that someVal exist in another RDD.
How can I do this across
Hi,
After running some tests it appears the unpersist is called as soon as it
is reached, so any tasks using this rdd later on will have to re calculate
it. This is fine for simple programs but when an rdd is created within a
function and its reference is then lost but children of it continue to
Change jdk from 1.8.0_45 to 1.7.0_79 solve this issue.
I saw https://issues.apache.org/jira/browse/SPARK-6388
But it is not a problem however.
On Thu, Jul 2, 2015 at 1:30 PM, xiaohe lan zombiexco...@gmail.com wrote:
Hi Expert,
Hadoop version: 2.4
Spark version: 1.3.1
I am running the
Hi Michael,
Thanks for a quick response.. This sounds like something that would work.
However, Rethinking the problem statement and various other use cases,
which are growing, there are more such scenarios, where one could have
columns with structured and unstructured data embedded (json or xml
Indeed Spark does not have .NET bindings.
On Thu, Jul 2, 2015 at 10:33 AM, Zwits daniel.van...@ortec-finance.com
wrote:
I'm currently looking into a way to run a program/code (DAG) written in
.NET
on a cluster using Spark. However I ran into problems concerning the coding
language, Spark has
Found this a bug in spark 1.4.0: SPARK-8368
https://issues.apache.org/jira/browse/SPARK-8368
Thanks!
Terry
On Thu, Jul 2, 2015 at 1:20 PM, Terry Hole hujie.ea...@gmail.com wrote:
All,
I am using spark console 1.4.0 to do some tests, when a create a newly
HiveContext (Line 18 in the code) in
My spark-sql command:
spark-sql --driver-memory 2g --master spark://hadoop04.xx.xx.com:8241 --conf
spark.driver.cores=20 --conf spark.cores.max=20 --conf
spark.executor.memory=2g --conf spark.driver.memory=2g --conf
spark.akka.frameSize=500 --conf spark.eventLog.enabled=true --conf
Hi Richard,
I have actually applied the following fix to our 1.4.0 version and this
seem to resolve the zombies :)
https://github.com/apache/spark/pull/7077/files
Sjoerd
2015-06-26 20:08 GMT+02:00 Richard Marscher rmarsc...@localytics.com:
Hi,
we are on 1.3.1 right now so in case there are
Hi there: I got an problem that Application has been killed.Reason:All
masters are unresponsive!Giving up. I check the network I/O and found
sometimes it is really high when running my app. Pls refer to the attached pic
for more info.I also checked
Since Spark runs on the JVM, no there isn't support for .Net.
You should take a look at Dryad and Naiad instead.
https://github.com/MicrosoftResearch/
From: Zwitsmailto:daniel.van...@ortec-finance.com
Sent: 7/2/2015 4:33 AM
To:
I'm currently looking into a way to run a program/code (DAG) written in .NET
on a cluster using Spark. However I ran into problems concerning the coding
language, Spark has no .NET API.
I tried looking into IronPython because Spark does have a Python API, but i
couldn't find a way to use this.
Is
Hi,
I'm using spark 1.4. I've a array field in my data frame and when I'm
trying to write this dataframe to postgres, I'm getting the following
exception:
Exception in thread main java.lang.IllegalArgumentException: Can't
translate null value for field
You may pass an optional parameter (blocking = false) to make it lazy.
Thank you,
Ilya Ganelin
-Original Message-
From: Jem Tucker [jem.tuc...@gmail.commailto:jem.tuc...@gmail.com]
Sent: Thursday, July 02, 2015 04:06 AM Eastern Standard Time
To: Akhil Das
Cc: user
Subject: Re: Making
Hi All ,
I have and Stream of Event coming in and i want to fetch some additional
data from the database based on the values in the incoming data , For Eg
below is the data coming in
loginName
Email
address
city
Now for each login name i need to go to oracle database and get the userId
from
Hi,
what are the options in DataFrame/JdbcRdd save/saveAsTable api.
is there any options to override/update a particular column in the table
instead of whole table overriding based on some ID colum.
SaveMode append is there but it wont help us to update the record,it will
append/add new row to
Hi,
I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start
the spark-shell it always start with HiveContext.
How can I disable the HiveContext from being initialized automatically ?
Thanks,
Daniel
You should be able to do something like this (assuming an input file formatted
as: String, IntVal, LongVal)
import org.apache.spark.sql.types._
val recSchema = StructType(List(StructField(strVal, StringType, false),
StructField(intVal, IntegerType,
Good to know this will be in next release. Thanks.
On Wed, Jul 1, 2015 at 3:13 PM, Michael Armbrust mich...@databricks.com
wrote:
We don't know that the table is small unless you cache it. In Spark 1.5
you'll be able to give us a hint though (
In case it helps: I got around it temporarily by saving and reseting the
context class loader around creating HiveContext.
On Jul 2, 2015 4:36 AM, Terry Hole hujie.ea...@gmail.com wrote:
Found this a bug in spark 1.4.0: SPARK-8368
https://issues.apache.org/jira/browse/SPARK-8368
Thanks!
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage
jars but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at
Which Spark release are you using ?
bq. yarn--jars
I guess the above was just a typo in your email (missing space).
Cheers
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage
jars
Ah I see, glad that simple patch works for your problem. That seems to be a
different underlying problem than we have been experiencing. In our case,
the executors are failing properly, its just that none of the new ones will
ever escape experiencing the same exact issue. So we start a death
Thanks
On Thu, Jul 2, 2015 at 5:40 PM, Kohler, Curt E (ELS-STL)
c.koh...@elsevier.com wrote:
You should be able to do something like this (assuming an input file
formatted as: String, IntVal, LongVal)
import org.apache.spark.sql.types._
val recSchema =
I am sorting a data frame using something like:
val sortedDF = df.orderBy(df(score).desc)
The sorting is really fast. The issue I have is that after sorting, the
resulting data frame sortedDF appears to be in a single partition, which is
a problem because when I try to execute another operation
Once again I am trying to read a directory tree using binary files.
My directory tree has a root dir ROOTDIR and subdirs where the files are
located, i.e.
ROOTDIR/1
ROOTDIR/2
ROOTDIR/..
ROOTDIR/100
A total of 1 mil files split into 100 sub dirs
Using binaryFiles requires too much memory on the
Hi Suraj,It seems your requirement is Record Linkage/Entity
Resolution.https://en.wikipedia.org/wiki/Record_linkage
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
A presentation from Spark Summit using
Hi Sim,
Seems you already set the PermGen size to 256m, right? I notice that in
your the shell, you created a HiveContext (it further increased the memory
consumption on PermGen). But, spark shell has already created a HiveContext
for you (sqlContext. You can use asInstanceOf to access
Same error with the new code:
import org.apache.spark.sql.hive.HiveContext
val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._
val df =
ctx.jsonFile(file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-0.gz)
df.registerTempTable(training)
val dfCount =
Hi all!
My sql case is:
insert overwrite table test1 select * From test;
In the job end got move file error.
I see hive-0.13.1 support for viewfs is not good. until hive-1.1.0+
How to upgrade the hive version for spark? Or how to fix the bug on
org.spark-project.hive.
My version:
Spark version
Thanks Sandy et al, I will try that. I like that I can choose the
minRegisteredResourcesRatio.
On Wed, Jun 24, 2015 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:
Hi Arun,
You can achieve this by
setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really
high number
i am surprised this is allowed...
scala sqlContext.sql(select name as boo, score as boo from
candidates).schema
res7: org.apache.spark.sql.types.StructType =
StructType(StructField(boo,StringType,true),
StructField(boo,IntegerType,true))
should StructType check for duplicate field names?
What is the rationale for not allowing the same column in a GroupedData to be
aggregated more than once using agg, especially when the method signature
def agg(aggExpr: (String, String), aggExprs: (String, String)*) allows
passing something like agg(x - sum, x =avg)?
--
View this message in
Error - ImportError: No module named Row
Cheers enjoy the long weekend
k/
Hi there, i check the source code and found that in
org.apache.spark.deploy.client.AppClient, there is a parameter tells(line 52):
val REGISTRATION_TIMEOUT = 20.seconds
val REGISTRATION_RETRIES = 3As I know If I wanna increase the retry times,
must I modify this value,rebuild the
Hi All,
I'm facing a quite strange case, where after migrating to Spark 140, I'm
seen SparkMLLib produces different results when runs on local mode and
cluster mode. Is there any possibility of that happening? (I feel this is
an issue in my environment, but just wanted to get confirmed.)
Thanks.
Hi,
On Thu, Jan 29, 2015 at 9:52 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Thu, Jan 29, 2015 at 1:54 AM, YaoPau jonrgr...@gmail.com wrote:
My thinking is to maintain state in an RDD and update it an persist it
with
each 2-second pass, but this also seems like it could get messy.
Hi Spark devs,
I'm coding a spark job and at a certain point in execution I need to send
some data present in an RDD to an external system.
val myRdd =
myRdd.foreach { record =
sendToWhtv(record)
}
The thing is that foreach forces materialization of the RDD and it seems to
be executed
*The thing is that foreach forces materialization of the RDD and it seems
to be executed on the driver program*
What makes you think that? No, foreach is run in the executors
(distributed) and not in the driver.
2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues
alex.jose.rodrig...@gmail.com:
Hi
Hi,
I'm using 1.4.
It's indeed a typo in the email itself.
Thanks,
Daniel
On Thu, Jul 2, 2015 at 6:06 PM, Ted Yu yuzhih...@gmail.com wrote:
Which Spark release are you using ?
bq. yarn--jars
I guess the above was just a typo in your email (missing space).
Cheers
On Thu, Jul 2, 2015 at
foreach absolutely runs on the executors. For sending data to an external
system you should likely use foreachPartition in order to batch the output.
Also if you want to limit the parallelism of the output action then you can use
coalesce.
What makes you think foreach is running on the driver?
if you are joining successive lines together based on a predicate, then you
are doing a flatMap not an aggregate. you are on the right track with a
multi-pass solution. i had the same challenge when i needed a sliding
window over an RDD(see below).
[ i had suggested that the sliding window API be
Thanks for the tip. Any idea why the intuitive answer doesn't work ( !=
None)? I inspected the Row columns and they do indeed have a None value. I
would suspect that somehow Python's None is translated to something in jvm
which doesn't equal to null?
I might check out the source code for a better
Foreach is listed as an action[1]. I guess an *action* just means that it
forces materialization of the RDD.
I just noticed much faster executions with map although I don't like the
map approach. I'll look at it with new eyes if foreach is the way to go.
[1] –
This will not work i.e. using data frame inside map function..
Although you can try to create df separately n cache it...
Then you can join your event stream with this df.
On Jul 2, 2015 6:11 PM, Ashish Soni asoni.le...@gmail.com wrote:
Hi All ,
I have and Stream of Event coming in and i want
SPARK-7879 https://issues.apache.org/jira/browse/SPARK-7879 seems to
address your use case (running KMeans on a dataframe and having the results
added as an additional column)
On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:
In preparing a DataFrame (spark 1.4) to
Hi, I got a cluster of 4 machines and I
sc.wholeTextFiles(/x/*/*.txt)
folder x contains subfolders and each subfolder contains thousand of files
with a total of ~1million matching the path expression.
My spark task starts processing the files but single threaded. I can see
that in the sparkUI,
In SparkUI I can see it creating 2 stages. I tried
wholeTextFiles().repartition(32) but same threading results.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html
Sent from the Apache Spark User List
You can collect the dataframe as array n then create map out of it...,
On Jul 2, 2015 9:23 AM, asoni.le...@gmail.com wrote:
Any example how can i return a Hashmap from data frame ?
Thanks ,
Ashish
On Jul 1, 2015, at 11:34 PM, Holden Karau hol...@pigscanfly.ca wrote:
Collecting it as a
Heh, an actions or materializaiton, means that it will trigger the
computation over the RDD. A transformation like map, means that it will
create the transformation chain that must be applied on the data, but it is
actually not executed. It is executed only when an action is triggered over
that
Hi
Sorry for this scala/spark newbie question. I am creating RDD which
represent large time series this way:
val data = sc.textFile(somefile.csv)
case class Event(
time: Double,
x: Double,
vztot: Double
)
val events = data.filter(s = !s.startsWith(GMT)).map{s =
What I'm doing in the RDD is parsing a text file and sending things to the
external system.. I guess that it does that immediately when the action
(count) is triggered instead of being a two step process.
So I guess I should have parsing logic + sending to external system inside
the foreach (with
Hi,
You can set those parameters through the
spark.executor.extraJavaOptions
Which is documented in the configuration guide:
spark.apache.org/docs/latest/configuration.htnl
On 2 Jul 2015 9:06 pm, Mulugeta Mammo mulugeta.abe...@gmail.com wrote:
Hi,
I'm running Spark 1.4.0, I want to specify
A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1 and
fails with a series of out-of-memory errors in 1.4.0.
This gist https://gist.github.com/ssimeonov/a49b75dc086c3ac6f3c4
includes the code and the full output from the 1.3.1 and 1.4.0 runs,
including the command line
Yes, that does appear to be the case. The documentation is very clear
about the heap settings and that they can not be used with
spark.executor.extraJavaOptions
spark.executor.extraJavaOptions(none)A string of extra JVM options to pass
to executors. For instance, GC settings or other logging.
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl a...@whisperstream.com wrote:
In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.
It's not true. When converting a RDD to dataframe, it only take
tried that one and it throws error - extraJavaOptions is not allowed to
alter memory settings, use spakr.executor.memory instead.
On Thu, Jul 2, 2015 at 12:21 PM, Benjamin Fradet benjamin.fra...@gmail.com
wrote:
Hi,
You can set those parameters through the
spark.executor.extraJavaOptions
Thanks, Mohit. It sounds like we're on the same page -- I used a similar
approach.
On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi mohitja...@gmail.com wrote:
if you are joining successive lines together based on a predicate, then
you are doing a flatMap not an aggregate. you are on the right
Hi Sim,
Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3
(explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you
add --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m in the
command you used to launch Spark shell? This will increase the PermGen size
You should use:
spark.executor.memory
from the docs https://spark.apache.org/docs/latest/configuration.html:
spark.executor.memory512mAmount of memory to use per executor process, in
the same format as JVM memory strings (e.g.512m, 2g).
-Todd
On Thu, Jul 2, 2015 at 3:36 PM, Mulugeta Mammo
thanks but my use case requires I specify different start and max heap
sizes. Looks like spark sets start and max sizes same value.
On Thu, Jul 2, 2015 at 1:08 PM, Todd Nist tsind...@gmail.com wrote:
You should use:
spark.executor.memory
from the docs
Ya, I think its a limitation too.I looked at the source code,
SparkConf.scala and ExecutorRunnable.scala both Xms and Xmx are set equal
value which is spark.executor.memory.
Thanks
On Thu, Jul 2, 2015 at 1:18 PM, Todd Nist tsind...@gmail.com wrote:
Yes, that does appear to be the case. The
How about:
events.sliding(3).zipWithIndex.filter(_._2 % 3 == 0)
That would group the RDD into adjacent buckets of size 3.
On Thu, Jul 2, 2015 at 2:33 PM, tog guillaume.all...@gmail.com wrote:
Was complaining about the Seq ...
Moved it to
val eventsfiltered = events.sliding(3).map(s =
Hi,
It seems to me spark launches a process to read the spark-deaults.conf
and then launch another process to do the app stuff.
The code here should confirm it:
https://github.com/apache/spark/blob/master/bin/spark-class#L76
$RUNNER -cp $LAUNCH_CLASSPATH org.apache.spark.launcher.Main $@
But
Was complaining about the Seq ...
Moved it to
val eventsfiltered = events.sliding(3).map(s = Event(s(0).time,
(s(0).x+s(1).x+s(2).x)/3.0 (s(0).vztot+s(1).vztot+s(2).vztot)/3.0))
and that is working.
Anyway this is not what I wanted to do, my goal was more to implement
bucketing to shorten the
After clicking the github spark repo, it is clearly here:
https://github.com/apache/spark/tree/master/launcher/src/main/java/org/apache/spark/launcher
My intellij project sidebar was fully expanded and I was lost in anther folder.
Problem solved.
Hi,
I'd like to specify the total sum of cores / memory as command line
arguments with spark-submit. That is, I'd like to set
yarn.nodemanager.resource.memory-mb and the
yarn.nodemanager.resource.cpu-vcores parameters as described in this
blog
Well it did reduce the length of my serie of events. I will have to dig
what it did actually ;-)
I would assume that it took one out of 3 value, is that correct ?
Would it be possible to control a bit more how the value assigned to the
bucket is computed for example take the first element, the
Consider an example dataset [a, b, c, d, e, f]
After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)]
After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1), ((c, d, e), 2), ((d, e,
f), 3)]
After filter: [((a,b,c), 0), ((d, e, f), 3)], which is what I'm assuming
you want (non-overlapping
Understood. Thanks for your great help
Cheers
Guillaume
On 2 July 2015 at 23:23, Feynman Liang fli...@databricks.com wrote:
Consider an example dataset [a, b, c, d, e, f]
After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)]
After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1),
73 matches
Mail list logo