Okay that was some caching issue. Now there is a shared mount point between
the place the pyspark code is executed and the spark nodes it runs. Hrmph,
I was hoping that wouldn't be the case. Fair enough!
On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote:
> Okay interesting, maybe my assumpt
/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0
so what is /data/hive even referring to when I print out the spark conf
values and neither now refer to /data/hive/
On Thu, Mar 7, 2024 at 9:49 PM Tom Barber wrote:
> Wonder if anyone can just sort my brain out h
Wonder if anyone can just sort my brain out here as to whats possible or
not.
I have a container running Spark, with Hive and a ThriftServer. I want to
run code against it remotely.
If I take something simple like this
from pyspark.sql import SparkSession
from pyspark.sql.types import
Unsubscribe
zes for your?
Tom
On Thursday, November 3, 2022 at 03:18:07 PM CDT, Shay Elbaz
wrote:
#yiv4404278030 P {margin-top:0;margin-bottom:0;}This is exactly what we ended
up doing! The only drawback I saw with this approach is that the GPU tasks get
pretty big (in terms of data and compute tim
, before that
mapPartitions you could do a repartition if necessary to get to exactly the
number of tasks you want (20). That way even if maxExecutors=500 you will only
ever need 20 or whatever you repartition to and spark isn't going to ask for
more then that.
Tom
On Thursday, November 3
task would use the GPU
and the other could just use the CPU. Perhaps that is to simplistic or brittle
though.
TomOn Saturday, July 31, 2021, 03:56:18 AM CDT, Andreas Kunft
wrote:
I have a setup with two work intensive tasks, one map using GPU followed by a
map using only CPU.
Using s
Looks like repartitioning was my friend, seems to be distributed across the
cluster now.
All good. Thanks!
On Wed, Jun 23, 2021 at 2:18 PM Tom Barber wrote:
> Okay so I tried another idea which was to use a real simple class to drive
> a mapPartitions... because logic in my head
b) how it divides up partitions to tasks
c) the fact its a POJO and not a file of stuff.
Or probably some of all 3.
Tom
On Wed, Jun 23, 2021 at 11:44 AM Tom Barber wrote:
> (I should point out that I'm diagnosing this by looking at the active
> tasks https://pasteboard.co/K7VryDJ.png, if
(I should point out that I'm diagnosing this by looking at the active tasks
https://pasteboard.co/K7VryDJ.png, if I'm reading it incorrectly, let me
know)
On Wed, Jun 23, 2021 at 11:38 AM Tom Barber wrote:
> Uff hello fine people.
>
> So the cause of the above issue was, unsur
how to split that flatmap
operation up so the RDD processing runs across the nodes, not limited to a
single node?
Thanks for all your help so far,
Tom
On Wed, Jun 9, 2021 at 8:08 PM Tom Barber wrote:
> Ah no sorry, so in the load image, the crawl has just kicked off on the
> driver node which
.
Tom
On Wed, Jun 9, 2021 at 8:03 PM Sean Owen wrote:
> Where do you see that ... I see 3 executors busy at first. If that's the
> crawl then ?
>
> On Wed, Jun 9, 2021 at 1:59 PM Tom Barber wrote:
>
>> Yeah :)
>>
>> But it's all running through the same
rst place?
>
> On Wed, Jun 9, 2021 at 1:49 PM Tom Barber wrote:
>
>> Yeah but that something else is the crawl being run, which is triggered
>> from inside the RDDs, because the log output is slowly outputting crawl
>> data.
>>
>>
--
Spicule Limited is reg
g else on the driver - not doing everything on 1 machine.
>
> On Wed, Jun 9, 2021 at 12:43 PM Tom Barber wrote:
>
>> And also as this morning: https://pasteboard.co/K5Q9aEf.png
>>
>> Removing the cpu pins gives me more tasks but as you can see here:
>>
>> https://pas
med.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 9 Jun 2021 at 18:43, Tom Barber wrote:
>
>> And also as this morning: https://pasteboard.co/K5Q9aEf.png
>>
>> Removing the
And also as this morning: https://pasteboard.co/K5Q9aEf.png
Removing the cpu pins gives me more tasks but as you can see here:
https://pasteboard.co/K5Q9GO0.png
It just loads up a single server.
On Wed, Jun 9, 2021 at 6:32 PM Tom Barber wrote:
> Thanks Chris
>
> All the co
se checks out.
I'll poke around in the other hints you suggested later, thanks for the
help.
Tom
On Wed, Jun 9, 2021 at 5:49 PM Chris Martin wrote:
> Hmm then my guesses are (in order of decreasing probability:
>
> * Whatever class makes up fetchedRdd (MemexDeepCrawlDbRDD?) isn't
> compati
.getGroup, r))
>
> how many distinct groups do you ended up with? If there's just one then I
> think you might see the behaviour you observe.
>
> Chris
>
>
> On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote:
>
>> Also just to follow up on that slightly, I di
ent] = repRdd.map(d =>
ScoreUpdateSolrTransformer(d))
I did that, but the crawl is executed in that repartition executor (which I
should have pointed out I already know).
Tom
On Wed, Jun 9, 2021 at 4:37 PM Tom Barber wrote:
> Sorry Sam, I missed that earlier, I'll give it a spin.
>
>
ache()
> repRdd.take(1)
> Then map operation on repRdd here.
>
> I’ve done similar map operations in the past and this works.
>
> Thanks.
>
> On Wed, Jun 9, 2021 at 11:17 AM Tom Barber wrote:
>
>> Also just to follow up on that slightly, I did also try off the back
RDD[SolrInputDocument] =
scoredRdd.repartition(50).map(d => ScoreUpdateSolrTransformer(d))
Where I repartitioned that scoredRdd map out of interest, it then triggers
the FairFetcher function there, instead of in the runJob(), but still on a
single executor
Tom
On Wed, Jun 9, 2021 at 4:11 PM Tom Barber
teRdd, scoreUpdateFunc)
When its doing stuff in the SparkUI I can see that its waiting on the
sc.runJob() line, so thats the execution point.
Tom
On Wed, Jun 9, 2021 at 3:59 PM Sean Owen wrote:
> persist() doesn't even persist by itself - just sets it to be persisted
> when it's execute
the tasks. Is that not the
case?
Thanks
Tom
On Wed, Jun 9, 2021 at 3:44 PM Mich Talebzadeh
wrote:
> Hi Tom,
>
> Persist() here simply means persist to memory). That is all. You can check
> UI tab on storage
>
>
> https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persi
; I think we need more info about what else is happening in the code.
>
> On Wed, Jun 9, 2021 at 6:30 AM Tom Barber wrote:
>
>> Yeah so if I update the FairFetcher to return a seq it makes no real
>> difference.
>>
>> Here's an image of what I'm seeing just for r
bfs:/FileStore/bcf/sparkler7.jar","crawl","-id","mytestcrawl11",
"-tn", "5000", "-co",
"{\"plugins.active\":[\"urlfilter-regex\",\"urlfilter-samehost\",\"fetcher-chrome\"],\"plugins\&
I've not run it yet, but I've stuck a toSeq on the end, but in reality a
Seq just inherits Iterator, right?
Flatmap does return a RDD[CrawlData] unless my IDE is lying to me.
Tom
On Wed, Jun 9, 2021 at 10:54 AM Tom Barber wrote:
> Interesting Jayesh, thanks, I will test.
>
> All
Interesting Jayesh, thanks, I will test.
All this code is inherited and it runs, but I don't think its been tested
in a distributed context for about 5 years, but yeah I need to get this
pushed down, so I'm happy to try anything! :)
Tom
On Wed, Jun 9, 2021 at 3:37 AM Lalwani, Jayesh wrote
For anyone interested here's the execution logs up until the point where it
actually kicks off the workload in question:
https://gist.github.com/buggtb/a9e0445f24182bc8eedfe26c0f07a473
On 2021/06/09 01:52:39, Tom Barber wrote:
> ExecutorID says driver, and looking at the IP addresses
gt; how many partitions does the groupByKey produce? that would limit your
> parallelism no matter what if it's a small number.
>
> On Tue, Jun 8, 2021 at 8:07 PM Tom Barber wrote:
>
> > Hi folks,
> >
> > Hopefully someone with more Spark experience than me can ex
ecause the processing of the
data in the RDD isn't the bottleneck, the fetching of the crawl data is the
bottleneck, but that happens after the code has been assigned to a node.
Thanks
Tom
-
To unsubscribe e-mail: user-un
it didn't run on the GPU is to enable the config:
spark.rapids.sql.explain=NOT_ON_GPU
It will print out logs to your console as to why different operators don't run
on the gpu.
Again feel free to open up a question issues in the spark-rapids repo and we
can discuss more there.
Tom
On Friday
" etc.
<https://stackoverflow.com/users/14147688/tom-scott>
On Tue, Sep 8, 2020 at 10:11 PM Tom Scott wrote:
> Hi Guys,
>
> I asked this in stack overflow here:
> https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-s
ee things like:
scala> someRdd.map(i=>i + ":" +
java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker3
2:worker1
3:worker2
scala> someRdd.map(i=>i + ":" +
java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker2
2:worker3
3:worker1
Am I doing this wrong or is this expected behaviour?
Thanks
Tom
I don't know if it all works but some work was done to make cluster manager
pluggable, see SPARK-13904.
Tom
On Wednesday, November 6, 2019, 07:22:59 PM CST, Klaus Ma
wrote:
Any suggestions?
- Klaus
On Mon, Nov 4, 2019 at 5:04 PM Klaus Ma wrote:
Hi team,
AFAIK, we built k8s/yarn
We are happy to announce the availability of Spark 2.2.2!
Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2
maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade
to this stable release. The release notes are available at
Thanks Jörn, sounds like there's nothing obvious I'm missing, which is
encouraging.
I've not used Redis, but it does seem that for most of my current and
likely future use-cases it would be the best fit (nice compromise of scale
and easy setup / access).
Thanks,
Tom
On Wed, Sep 14, 2016 at 10
the cluster.
I guess there's no solution that fits all, but interested in other people's
experience and whether I've missed anything obvious.
Thanks,
Tom
)
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
Cheers,
Tom Ellis
Consultant Developer
not have access?
Cheers,
Tom Ellis
Consultant Developer - Excelian
Data Lake | Financial Markets IT
LLOYDS BANK COMMERCIAL BANKING
E: tom.el...@lloydsbanking.com<mailto:tom.el...@lloydsbanking.com>
Website: www.lloydsbankcommercial.co
the source of Client [1] and YarnSparkHadoopUtil [2] – you’ll see
how obtainTokenForHBase is being done.
It’s a bit confusing as to why it says you haven’t kinited even when you do
loginUserFromKeytab – I haven’t quite worked through the reason for that yet.
Cheers,
Tom Ellis
telli...@gmail.com
pache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions
>
> An RDD of T produces an RDD of T[].
>
> On Fri, May 13, 2016 at 12:10 PM, Tom Godden <tgod...@vub.ac.be> wrote:
>> I assumed the "fixed size blocks" mentioned in the documentation
>&g
re. The return type is an RDD of
> arrays, not of RDDs or of ArrayLists. There may be another catch but
> that is not it.
>
> On Fri, May 13, 2016 at 11:50 AM, Tom Godden <tgod...@vub.ac.be> wrote:
>> I believe it's an illegal cast. This is the line of code:
>>> RDD
I believe it's an illegal cast. This is the line of code:
> RDD> windowed =
> RDDFunctions.fromRDD(vals.rdd(), vals.classTag()).sliding(20, 1);
with vals being a JavaRDD. Explicitly casting
doesn't work either:
> RDD> windowed = (RDD>)
>
I would like to also Mich, please send it through, thanks!
On Thu, 12 May 2016 at 15:14 Alonso Isidoro wrote:
> Me too, send me the guide.
>
> Enviado desde mi iPhone
>
> El 12 may 2016, a las 12:11, Ashok Kumar >
Solved:
Call spark-submit with
--driver-memory 512m --driver-java-options
"-Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2
-Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2"
Thanks to:
https://issues.apache.org/jira/browse/SPARK-14367
--
View this
Hi,
I am trying to get the same memory behavior in Spark 1.6 as I had in Spark
1.3 with default settings.
I set
--driver-java-options "--Dspark.memory.useLegacyMode=true
-Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6
-Dspark.storage.unrollFraction=0.2"
in Spark 1.6.
But
this setting could be related.
Would greatly appreciated any advice.
Thanks in advance,
Tom
Hey,
I’m wondering if anyone has run into issues with Spark 1.5 and a FileNotFound
exception with shuffle.index files? It’s been cropping up with very large joins
and aggregations, and causing all of our jobs to fail towards the end. The
memory limit for the executors (we’re running on mesos)
Hi Romi,
Thanks! Could you give me an indication of how much increase the partitions by?
We’ll take a stab in the dark, the input data is around 5M records (though each
record is fairly small). We’ve had trouble both with DataFrames and RDDs.
Tom.
> On 18 Nov 2015, at 12:04, Romi Kuntsman
Is there anything other then the spark assembly that needs to be in the
classpath? I verified the assembly was built right and its in the classpath
(else nothing would work).
Thanks,Tom
On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman
<shiva...@eecs.berkeley.edu>
n$fit$2.apply(Pipeline.scala:138) at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
Anyone have this working?
Thanks,Tom
I have the following script in a file named test.R:
library(SparkR)
sc <- sparkR.init(master="yarn-client")
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, faithful)
showDF(df)
sparkR.stop()
q(save="no")
If I submit this with "sparkR test.R" or "R CMD BATCH test.R" or
I am running the following command on a Hadoop cluster to launch Spark shell
with DRA:
spark-shell --conf spark.dynamicAllocation.enabled=true --conf
spark.shuffle.service.enabled=true --conf
spark.dynamicAllocation.minExecutors=4 --conf
spark.dynamicAllocation.maxExecutors=12 --conf
I would like to change the logging level for my application running on a
standalone Spark cluster. Is there an easy way to do that without changing
the log4j.properties on each individual node?
Thanks,Tom
an issue regarding improvement of the docs? For those of us who are
gaining the experience having such a pointer is very helpful.
Tom
From: Tim Chen <t...@mesosphere.io<mailto:t...@mesosphere.io>>
Date: Thursday, September 10, 2015 at 10:25 AM
To: Tom Waterhouse <tomwa...@cisco.c
r
http://stackoverflow.com/questions/31294515/start-spark-via-mesos
There must be better documentation on how to deploy Spark in Mesos with jobs
able to be deployed in cluster mode.
I can follow up with more specific information regarding my deployment if
necessary.
Tom
)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Thanks,
Tom
d.par"
define my table columns
)
Is something like that possible, does that make any sense?
Thanks
Tom
Thanks for your reply Aniket.
Ok I've done this and I'm still confused. Output from running locally
shows:
file:/home/tom/spark-avro/target/scala-2.10/simpleapp.jar
file:/home/tom/spark-1.4.0-bin-hadoop2.4/conf/
file:/home/tom/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar
to not use HDFS)
* Bonus question: Should I use a different API to get a better performance?
Thanks for any responses!
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html
Sent from
?
Thanks in advance,
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
metrics will someday be included in the Hadoop FileStatistics
API. In the meantime, it is not currently possible to understand how much of
a Spark task's time is spent reading from disk via HDFS.
That said, this might be posted as a footnote at the event timeline to avoid
confusion :)
Best regards,
Tom
I believe that as you are not persisting anything into the memory space
defined by
spark.storage.memoryFraction
you also have nothing to clear from this area using the unpersist.
FYI: The data will be kept in the OS-buffer/on disk at the point of the
reduce (as this involves a wide dependency -
is only available on pairRDD's, this might have something to with it..)
I am using the spark master branch. The error:
[error]
/home/th/spark-1.5.0/spark/IBM_ARL_teraSort_v4-01/src/main/scala/IBM_ARL_teraSort.scala:107:
value partitionBy is not a member of org.apache.spark.sql.DataFrame
Thanks,
Tom
implemented in
dataFrames (?) and makes me wonder if I then should just use dataFrames in
my regular computation.
Thanks in advance,
Tom
P.S. currently using the master branch from the gitHub
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrames-for-non
at 12:05 PM Tom Seddon mr.tom.sed...@gmail.com wrote:
Hi,
I've worked out how to use explode on my input avro dataset with the
following structure
root
|-- pageViewId: string (nullable = false)
|-- components: array (nullable = true)
||-- element: struct (containsNull = false
Hi,
I've worked out how to use explode on my input avro dataset with the
following structure
root
|-- pageViewId: string (nullable = false)
|-- components: array (nullable = true)
||-- element: struct (containsNull = false)
|||-- name: string (nullable = false)
|||--
Thanks for the responses.
Try removing toDebugString and see what happens.
The toDebugString is performed after [d] (the action), as [e]. By then all
stages are already executed.
--
View this message in context:
]), and with larger input set can also take
a noticeable time. Does anybody have any idea what is running in this
Job/stage 0?
Thanks,
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation
I'm not sure, but I wonder if because you are using the Spark REPL that it
may not be representing what a normal runtime execution would look like and
is possibly eagerly running a partial DAG once you define an operation that
would cause a shuffle.
What happens if you setup your same set of
Thank you for your response Ewan. I quickly looked yesterday and it was
there, but today at work I tried to open it again to start working on it,
but it appears to be removed. Is this correct?
Thanks,
Tom
On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be wrote:
Hi all.
The code
Thanks,
Tom
P.S. (I know that the data might not end up being uniformly distributed,
example: 4 elements in part-0 and 2 in part-1)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sortByKey-with-multiple-partitions-tp22426.html
Sent from the Apache
source code.
My question:
Could you guys please make the source code of the used TeraSort program,
preferably with settings, available? If not, what are the reasons that this
seems to be withheld?
Thanks for any help,
Tom Hubregtsen
[1]
https://github.com/rxin/spark/commit
We verified it runs on x86, and are now trying to run it on powerPC. We
currently run into dependency trouble with sbt. I tried installing sbt by
hand and resolving all dependencies by hand, but must have made an error, as
I still get errors.
Original error:
Getting org.scala-sbt sbt 0.13.6 ...
you can
use ~ there - IIRC it does not do any kind of variable expansion.
On Mon, Mar 30, 2015 at 3:50 PM, Tom thubregt...@gmail.com wrote:
I have set
spark.eventLog.enabled true
as I try to preserve log files. When I run, I get
Log directory /tmp/spark-events does not exist.
I set
by
hduser. I even performed chmod 777, but Spark keeps on crashing when I run
with spark.eventLog.enabled. It works without. Any hints?
Thanks,
Tom
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-events-does-not-exist-error-while-it-does-with-all-the-req
?
(It always helps to show the command line you're actually running, and
if there's an exception, the first few frames of the stack trace.)
On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen thubregt...@gmail.com
wrote:
Updated spark-defaults and spark-env:
Log directory /home/hduser/spark/spark-events
listed in the error message (i, ii), created a text file, closed it an
viewed it, and deleted it (iii). My findings were reconfirmed by my
colleague. Any other ideas?
Thanks,
Tom
On 30 March 2015 at 19:19, Marcelo Vanzin van...@cloudera.com wrote:
So, the error below is still showing the invalid
$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294)
...
sqlCtx.tables()
DataFrame[tableName: string, isTemporary: boolean]
exit()
~ cat /tmp/test10/part-0
{key:0,value:0}
{key:1,value:1}
{key:2,value:2}
{key:3,value:3}
{key:4,value:4}
{key:5,value:5}
Kind Regards,
Tom
On 27 March
to expect that Spark create an external table in this case? What
is the expected behaviour of saveAsTable with the path option?
Setup: running spark locally with spark 1.3.0.
Kind Regards,
Tom
Another follow-up: saveAsTable works as expected when running on hadoop
cluster with Hive installed. It's just locally that I'm getting this
strange behaviour. Any ideas why this is happening?
Kind Regards.
Tom
On 27 March 2015 at 11:29, Tom Walwyn twal...@gmail.com wrote:
We can set a path
paragraph about Broadcast Variables, I read The value is sent to
each node only once, using an efficient, BitTorrent-like communication
mechanism.
- Is the book talking about the proposed BTB from the paper?
- Is this currently the default?
- If not, what is?
Thanks,
Tom
--
View
.pdf.
It is expected to scale sub-linearly; i.e., O(log N), where N is the
number of machines in your cluster.
We evaluated up to 100 machines, and it does follow O(log N) scaling.
--
Mosharaf Chowdhury
http://www.mosharaf.com/
On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen thubregt
Thanks Mosharaf, for the quick response! Can you maybe give me some
pointers to an explanation of this strategy? Or elaborate a bit more on it?
Which parts are involved in which way? Where are the time penalties and how
scalable is this implementation?
Thanks again,
Tom
On 11 March 2015 at 16
message, I see
while (read TeraInputFormat.RECORD_LEN) {
- Is it possible that this restricts the branch from running on a cluster?
- Did anybody manage to run this branch on a cluster?
Thanks,
Tom
15/02/25 17:55:42 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
arlab152
The SparkConf doesn't allow you to set arbitrary variables. You can use
SparkContext's HadoopRDD and create a JobConf (with whatever variables you
want), and then grab them out of the JobConf in your RecordReader.
On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote:
Hi,
I
Rashid iras...@cloudera.com wrote:
Hi Tom,
there are a couple of things you can do here to make this more efficient.
first, I think you can replace your self-join with a groupByKey. on your
example data set, this would give you
(1, Iterable(2,3))
(4, Iterable(3))
this reduces the amount
))
Thanks
Best Regards
On Wed, Feb 18, 2015 at 12:21 PM, Tom Walwyn twal...@gmail.com wrote:
Hi All,
I'm a new Spark (and Hadoop) user and I want to find out if the cluster
resources I am using are feasible for my use-case. The following is a
snippet of code that is causing a OOM exception
Hi All,
I'm a new Spark (and Hadoop) user and I want to find out if the cluster
resources I am using are feasible for my use-case. The following is a
snippet of code that is causing a OOM exception in the executor after about
125/1000 tasks during the map stage.
val rdd2 = rdd.join(rdd,
Hi,
I've searched but can't seem to find a PySpark example. How do I write
compressed text file output to S3 using PySpark saveAsTextFile?
Thanks,
Tom
I'm trying to set up a PySpark ETL job that takes in JSON log files and
spits out fact table files for upload to Redshift. Is there an efficient
way to send different event types to different outputs without having to
just read the same cached RDD twice? I have my first RDD which is just a
json
)
.set(spark.driver.memory, 26).
.set(spark.storage.memoryFraction,1)
.set(spark.core.connection.ack.wait.timeout,6000)
.set(spark.akka.frameSize,50)
Thanks,
Tom
On 24 October 2014 12:31, htailor hemant.tai...@live.co.uk wrote:
Hi All,
I am relatively new to spark and currently having
Yes please can you share. I am getting this error after expanding my
application to include a large broadcast variable. Would be good to know if
it can be fixed with configuration.
On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com
wrote:
Can you list what your fix was so
Hi,
I am trying to call some c code, let's say the compiled file is /path/code,
and it has chmod +x. When I call it directly, it works. Now i want to call
it from Spark 1.1. My problem is not building it into Spark, but making sure
Spark can find it.
I have tried:
; permission issues if I try?
Again, I searched the archives but didn't see any of this, but I'm just getting
started so may very well
be missing this somewhere.
Thanks!
Tom
-benchmark/pavlo/text/tiny/crawl)
dataset.saveAsTextFile(/home/tom/hadoop/bigDataBenchmark/test/crawl3.txt)
If you want to do this more often, or use it directly from the cloud instead
of from local (which will be slower), you can add these keys to
./conf/spark-env.sh
--
View this message in context
From my map function I create Tuple2Integer, Integer pairs. Now I want to
reduce them, and get something like Tuple2Integer, Listlt;Integer.
The only way I found to do this was by treating all variables as String, and
in the reduceByKey do
/return a._2 + , + b._2/ //in which both are numeric
Is it possible to generate a JavaPairRDDString, Integer from a
JavaPairRDDString, String, where I can also use the key values? I have
looked at for instance mapToPair, but this generates a new K/V pair based on
the original value, and does not give me information about the key.
I need this in the
Hi,
I would like to create multiple key-value pairs, where all keys still can be
reduced. For instance, I have the following 2 lines:
A,B,C
B,D
I would like to return the following pairs for the first line:
A,B
A,C
B,A
B,C
C,A
C,B
And for the second
B,D
D,B
After a reduce by key, I want to end
files/rdd's would be a
bonus!
Thanks in advance,
Tom
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-make-sense-of-the-actual-executed-code-tp11594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
1 - 100 of 119 matches
Mail list logo