Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my assumpt

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0 so what is /data/hive even referring to when I print out the spark conf values and neither now refer to /data/hive/ On Thu, Mar 7, 2024 at 9:49 PM Tom Barber wrote: > Wonder if anyone can just sort my brain out h

Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Wonder if anyone can just sort my brain out here as to whats possible or not. I have a container running Spark, with Hive and a ThriftServer. I want to run code against it remotely. If I take something simple like this from pyspark.sql import SparkSession from pyspark.sql.types import

Unsubscribe

2023-09-11 Thread Tom Praison
Unsubscribe

Re: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-04 Thread Tom Graves
zes for your? Tom On Thursday, November 3, 2022 at 03:18:07 PM CDT, Shay Elbaz wrote: #yiv4404278030 P {margin-top:0;margin-bottom:0;}This is exactly what we ended up doing! The only drawback I saw with this approach is that the GPU tasks get pretty big (in terms of data and compute tim

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Tom Graves
, before that mapPartitions you could do a repartition if necessary to get to exactly the number of tasks you want (20).  That way even if maxExecutors=500 you will only ever need 20 or whatever you repartition to and spark isn't going to ask for more then that. Tom On Thursday, November 3

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

2021-08-05 Thread Tom Graves
task would use the GPU and the other could just use the CPU.  Perhaps that is to simplistic or brittle though. TomOn Saturday, July 31, 2021, 03:56:18 AM CDT, Andreas Kunft wrote: I have a setup with two work intensive tasks, one map using GPU followed by a map using only CPU. Using s

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
Looks like repartitioning was my friend, seems to be distributed across the cluster now. All good. Thanks! On Wed, Jun 23, 2021 at 2:18 PM Tom Barber wrote: > Okay so I tried another idea which was to use a real simple class to drive > a mapPartitions... because logic in my head

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
b) how it divides up partitions to tasks c) the fact its a POJO and not a file of stuff. Or probably some of all 3. Tom On Wed, Jun 23, 2021 at 11:44 AM Tom Barber wrote: > (I should point out that I'm diagnosing this by looking at the active > tasks https://pasteboard.co/K7VryDJ.png, if

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
(I should point out that I'm diagnosing this by looking at the active tasks https://pasteboard.co/K7VryDJ.png, if I'm reading it incorrectly, let me know) On Wed, Jun 23, 2021 at 11:38 AM Tom Barber wrote: > Uff hello fine people. > > So the cause of the above issue was, unsur

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
how to split that flatmap operation up so the RDD processing runs across the nodes, not limited to a single node? Thanks for all your help so far, Tom On Wed, Jun 9, 2021 at 8:08 PM Tom Barber wrote: > Ah no sorry, so in the load image, the crawl has just kicked off on the > driver node which

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
. Tom On Wed, Jun 9, 2021 at 8:03 PM Sean Owen wrote: > Where do you see that ... I see 3 executors busy at first. If that's the > crawl then ? > > On Wed, Jun 9, 2021 at 1:59 PM Tom Barber wrote: > >> Yeah :) >> >> But it's all running through the same

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
rst place? > > On Wed, Jun 9, 2021 at 1:49 PM Tom Barber wrote: > >> Yeah but that something else is the crawl being run, which is triggered >> from inside the RDDs, because the log output is slowly outputting crawl >> data. >> >> -- Spicule Limited is reg

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
g else on the driver - not doing everything on 1 machine. > > On Wed, Jun 9, 2021 at 12:43 PM Tom Barber wrote: > >> And also as this morning: https://pasteboard.co/K5Q9aEf.png >> >> Removing the cpu pins gives me more tasks but as you can see here: >> >> https://pas

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
med. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 9 Jun 2021 at 18:43, Tom Barber wrote: > >> And also as this morning: https://pasteboard.co/K5Q9aEf.png >> >> Removing the

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
And also as this morning: https://pasteboard.co/K5Q9aEf.png Removing the cpu pins gives me more tasks but as you can see here: https://pasteboard.co/K5Q9GO0.png It just loads up a single server. On Wed, Jun 9, 2021 at 6:32 PM Tom Barber wrote: > Thanks Chris > > All the co

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
se checks out. I'll poke around in the other hints you suggested later, thanks for the help. Tom On Wed, Jun 9, 2021 at 5:49 PM Chris Martin wrote: > Hmm then my guesses are (in order of decreasing probability: > > * Whatever class makes up fetchedRdd (MemexDeepCrawlDbRDD?) isn't > compati

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
.getGroup, r)) > > how many distinct groups do you ended up with? If there's just one then I > think you might see the behaviour you observe. > > Chris > > > On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote: > >> Also just to follow up on that slightly, I di

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
ent] = repRdd.map(d => ScoreUpdateSolrTransformer(d)) I did that, but the crawl is executed in that repartition executor (which I should have pointed out I already know). Tom On Wed, Jun 9, 2021 at 4:37 PM Tom Barber wrote: > Sorry Sam, I missed that earlier, I'll give it a spin. > >

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
ache() > repRdd.take(1) > Then map operation on repRdd here. > > I’ve done similar map operations in the past and this works. > > Thanks. > > On Wed, Jun 9, 2021 at 11:17 AM Tom Barber wrote: > >> Also just to follow up on that slightly, I did also try off the back

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
RDD[SolrInputDocument] = scoredRdd.repartition(50).map(d => ScoreUpdateSolrTransformer(d)) Where I repartitioned that scoredRdd map out of interest, it then triggers the FairFetcher function there, instead of in the runJob(), but still on a single executor  Tom On Wed, Jun 9, 2021 at 4:11 PM Tom Barber

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
teRdd, scoreUpdateFunc) When its doing stuff in the SparkUI I can see that its waiting on the sc.runJob() line, so thats the execution point. Tom On Wed, Jun 9, 2021 at 3:59 PM Sean Owen wrote: > persist() doesn't even persist by itself - just sets it to be persisted > when it's execute

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
the tasks. Is that not the case? Thanks Tom On Wed, Jun 9, 2021 at 3:44 PM Mich Talebzadeh wrote: > Hi Tom, > > Persist() here simply means persist to memory). That is all. You can check > UI tab on storage > > > https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persi

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
; I think we need more info about what else is happening in the code. > > On Wed, Jun 9, 2021 at 6:30 AM Tom Barber wrote: > >> Yeah so if I update the FairFetcher to return a seq it makes no real >> difference. >> >> Here's an image of what I'm seeing just for r

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
bfs:/FileStore/bcf/sparkler7.jar","crawl","-id","mytestcrawl11", "-tn", "5000", "-co", "{\"plugins.active\":[\"urlfilter-regex\",\"urlfilter-samehost\",\"fetcher-chrome\"],\"plugins\&

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
I've not run it yet, but I've stuck a toSeq on the end, but in reality a Seq just inherits Iterator, right? Flatmap does return a RDD[CrawlData] unless my IDE is lying to me. Tom On Wed, Jun 9, 2021 at 10:54 AM Tom Barber wrote: > Interesting Jayesh, thanks, I will test. > > All

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Interesting Jayesh, thanks, I will test. All this code is inherited and it runs, but I don't think its been tested in a distributed context for about 5 years, but yeah I need to get this pushed down, so I'm happy to try anything! :) Tom On Wed, Jun 9, 2021 at 3:37 AM Lalwani, Jayesh wrote

Re: Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber
For anyone interested here's the execution logs up until the point where it actually kicks off the workload in question: https://gist.github.com/buggtb/a9e0445f24182bc8eedfe26c0f07a473 On 2021/06/09 01:52:39, Tom Barber wrote: > ExecutorID says driver, and looking at the IP addresses

Re: Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber
gt; how many partitions does the groupByKey produce? that would limit your > parallelism no matter what if it's a small number. > > On Tue, Jun 8, 2021 at 8:07 PM Tom Barber wrote: > > > Hi folks, > > > > Hopefully someone with more Spark experience than me can ex

Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber
ecause the processing of the data in the RDD isn't the bottleneck, the fetching of the crawl data is the bottleneck, but that happens after the code has been assigned to a node. Thanks Tom - To unsubscribe e-mail: user-un

Re: GPU job in Spark 3

2021-04-09 Thread Tom Graves
it didn't run on the GPU is to enable the config:  spark.rapids.sql.explain=NOT_ON_GPU It will print out logs to your console as to why different operators don't run on the gpu.   Again feel free to open up a question issues in the spark-rapids repo and we can discuss more there. Tom On Friday

Re: [Spark Core] makeRDD() preferredLocations do not appear to be considered

2020-09-12 Thread Tom Scott
" etc. <https://stackoverflow.com/users/14147688/tom-scott> On Tue, Sep 8, 2020 at 10:11 PM Tom Scott wrote: > Hi Guys, > > I asked this in stack overflow here: > https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-s

[Spark Core] makeRDD() preferredLocations do not appear to be considered

2020-09-08 Thread Tom Scott
ee things like: scala> someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println) 1:worker3 2:worker1 3:worker2 scala> someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println) 1:worker2 2:worker3 3:worker1 Am I doing this wrong or is this expected behaviour? Thanks Tom

Re: Build customized resource manager

2019-11-08 Thread Tom Graves
I don't know if it all works but some work was done to make cluster manager pluggable, see SPARK-13904. Tom On Wednesday, November 6, 2019, 07:22:59 PM CST, Klaus Ma wrote: Any suggestions? - Klaus On Mon, Nov 4, 2019 at 5:04 PM Klaus Ma wrote: Hi team, AFAIK, we built k8s/yarn

[ANNOUNCE] Apache Spark 2.2.2

2018-07-10 Thread Tom Graves
We are happy to announce the availability of Spark 2.2.2! Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2 maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade to this stable release. The release notes are available at 

Re: Streaming - lookup against reference data

2016-09-15 Thread Tom Davis
Thanks Jörn, sounds like there's nothing obvious I'm missing, which is encouraging. I've not used Redis, but it does seem that for most of my current and likely future use-cases it would be the best fit (nice compromise of scale and easy setup / access). Thanks, Tom On Wed, Sep 14, 2016 at 10

Streaming - lookup against reference data

2016-09-14 Thread Tom Davis
the cluster. I guess there's no solution that fits all, but interested in other people's experience and whether I've missed anything obvious. Thanks, Tom

Unresponsive Spark Streaming UI in YARN cluster mode - 1.5.2

2016-07-08 Thread Ellis, Tom (Financial Markets IT)
) org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79) org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136) org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) Cheers, Tom Ellis Consultant Developer

In yarn-cluster mode, provide system prop to the client jvm

2016-06-16 Thread Ellis, Tom (Financial Markets IT)
not have access? Cheers, Tom Ellis Consultant Developer - Excelian Data Lake | Financial Markets IT LLOYDS BANK COMMERCIAL BANKING E: tom.el...@lloydsbanking.com<mailto:tom.el...@lloydsbanking.com> Website: www.lloydsbankcommercial.co

RE: HBase / Spark Kerberos problem

2016-05-19 Thread Ellis, Tom (Financial Markets IT)
the source of Client [1] and YarnSparkHadoopUtil [2] – you’ll see how obtainTokenForHBase is being done. It’s a bit confusing as to why it says you haven’t kinited even when you do loginUserFromKeytab – I haven’t quite worked through the reason for that yet. Cheers, Tom Ellis telli...@gmail.com

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden
pache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions > > An RDD of T produces an RDD of T[]. > > On Fri, May 13, 2016 at 12:10 PM, Tom Godden <tgod...@vub.ac.be> wrote: >> I assumed the "fixed size blocks" mentioned in the documentation >&g

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden
re. The return type is an RDD of > arrays, not of RDDs or of ArrayLists. There may be another catch but > that is not it. > > On Fri, May 13, 2016 at 11:50 AM, Tom Godden <tgod...@vub.ac.be> wrote: >> I believe it's an illegal cast. This is the line of code: >>> RDD

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden
I believe it's an illegal cast. This is the line of code: > RDD> windowed = > RDDFunctions.fromRDD(vals.rdd(), vals.classTag()).sliding(20, 1); with vals being a JavaRDD. Explicitly casting doesn't work either: > RDD> windowed = (RDD>) >

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Tom Ellis
I would like to also Mich, please send it through, thanks! On Thu, 12 May 2016 at 15:14 Alonso Isidoro wrote: > Me too, send me the guide. > > Enviado desde mi iPhone > > El 12 may 2016, a las 12:11, Ashok Kumar >

Re: Using spark.memory.useLegacyMode true does not yield expected behavior

2016-04-11 Thread Tom Hubregtsen
Solved: Call spark-submit with --driver-memory 512m --driver-java-options "-Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" Thanks to: https://issues.apache.org/jira/browse/SPARK-14367 -- View this

Using spark.memory.useLegacyMode true does not yield expected behavior

2016-03-29 Thread Tom Hubregtsen
Hi, I am trying to get the same memory behavior in Spark 1.6 as I had in Spark 1.3 with default settings. I set --driver-java-options "--Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" in Spark 1.6. But

What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon
this setting could be related. Would greatly appreciated any advice. Thanks in advance, Tom

Shuffle FileNotFound Exception

2015-11-18 Thread Tom Arnfeld
Hey, I’m wondering if anyone has run into issues with Spark 1.5 and a FileNotFound exception with shuffle.index files? It’s been cropping up with very large joins and aggregations, and causing all of our jobs to fail towards the end. The memory limit for the executors (we’re running on mesos)

Re: Shuffle FileNotFound Exception

2015-11-18 Thread Tom Arnfeld
Hi Romi, Thanks! Could you give me an indication of how much increase the partitions by? We’ll take a stab in the dark, the input data is around 5M records (though each record is fairly small). We’ve had trouble both with DataFrames and RDDs. Tom. > On 18 Nov 2015, at 12:04, Romi Kuntsman

Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-11 Thread Tom Graves
Is there anything other then the spark assembly that needs to be in the classpath?  I verified the assembly was built right and its in the classpath (else nothing would work). Thanks,Tom On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman <shiva...@eecs.berkeley.edu>

anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-06 Thread Tom Graves
n$fit$2.apply(Pipeline.scala:138)        at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134) Anyone have this working? Thanks,Tom

sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-10-30 Thread Tom Stewart
I have the following script in a file named test.R: library(SparkR) sc <- sparkR.init(master="yarn-client") sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) showDF(df) sparkR.stop() q(save="no") If I submit this with "sparkR test.R" or "R  CMD BATCH test.R" or

Spark 1.5.1 Dynamic Resource Allocation

2015-10-30 Thread Tom Stewart
I am running the following command on a Hadoop cluster to launch Spark shell with DRA: spark-shell  --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=4 --conf spark.dynamicAllocation.maxExecutors=12 --conf

Changing application log level in standalone cluster

2015-10-13 Thread Tom Graves
I would like to change the logging level for my application running on a standalone Spark cluster.  Is there an easy way to do that  without changing the log4j.properties on each individual node? Thanks,Tom

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-11 Thread Tom Waterhouse (tomwater)
an issue regarding improvement of the docs? For those of us who are gaining the experience having such a pointer is very helpful. Tom From: Tim Chen <t...@mesosphere.io<mailto:t...@mesosphere.io>> Date: Thursday, September 10, 2015 at 10:25 AM To: Tom Waterhouse <tomwa...@cisco.c

Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-10 Thread Tom Waterhouse (tomwater)
r http://stackoverflow.com/questions/31294515/start-spark-via-mesos There must be better documentation on how to deploy Spark in Mesos with jobs able to be deployed in cluster mode. I can follow up with more specific information regarding my deployment if necessary. Tom

java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon
) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks, Tom

Help getting Spark JDBC metadata

2015-09-09 Thread Tom Barber
d.par" define my table columns ) Is something like that possible, does that make any sense? Thanks Tom

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon
Thanks for your reply Aniket. Ok I've done this and I'm still confused. Output from running locally shows: file:/home/tom/spark-avro/target/scala-2.10/simpleapp.jar file:/home/tom/spark-1.4.0-bin-hadoop2.4/conf/ file:/home/tom/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar

50% performance decrease when using local file vs hdfs

2015-07-24 Thread Tom Hubregtsen
to not use HDFS) * Bonus question: Should I use a different API to get a better performance? Thanks for any responses! Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html Sent from

Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
? Thanks in advance, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
metrics will someday be included in the Hadoop FileStatistics API. In the meantime, it is not currently possible to understand how much of a Spark task's time is spent reading from disk via HDFS. That said, this might be posted as a footnote at the event timeline to avoid confusion :) Best regards, Tom

Re: Un-persist RDD in a loop

2015-06-23 Thread Tom Hubregtsen
I believe that as you are not persisting anything into the memory space defined by spark.storage.memoryFraction you also have nothing to clear from this area using the unpersist. FYI: The data will be kept in the OS-buffer/on disk at the point of the reduce (as this involves a wide dependency -

PartitionBy/Partitioner for dataFrames?

2015-06-21 Thread Tom Hubregtsen
is only available on pairRDD's, this might have something to with it..) I am using the spark master branch. The error: [error] /home/th/spark-1.5.0/spark/IBM_ARL_teraSort_v4-01/src/main/scala/IBM_ARL_teraSort.scala:107: value partitionBy is not a member of org.apache.spark.sql.DataFrame Thanks, Tom

DataFrames for non-SQL computation?

2015-06-11 Thread Tom Hubregtsen
implemented in dataFrames (?) and makes me wonder if I then should just use dataFrames in my regular computation. Thanks in advance, Tom P.S. currently using the master branch from the gitHub -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrames-for-non

Re: SparkSQL DF.explode with Nulls

2015-06-05 Thread Tom Seddon
at 12:05 PM Tom Seddon mr.tom.sed...@gmail.com wrote: Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false

SparkSQL DF.explode with Nulls

2015-06-04 Thread Tom Seddon
Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false) |||-- name: string (nullable = false) |||--

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
Thanks for the responses. Try removing toDebugString and see what happens. The toDebugString is performed after [d] (the action), as [e]. By then all stages are already executed. -- View this message in context:

Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
]), and with larger input set can also take a noticeable time. Does anybody have any idea what is running in this Job/stage 0? Thanks, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of

Re: Spark TeraSort source request

2015-04-13 Thread Tom Hubregtsen
Thank you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct? Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be wrote: Hi all. The code

sortByKey with multiple partitions

2015-04-08 Thread Tom
Thanks, Tom P.S. (I know that the data might not end up being uniformly distributed, example: 4 elements in part-0 and 2 in part-1) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sortByKey-with-multiple-partitions-tp22426.html Sent from the Apache

Spark TeraSort source request

2015-04-03 Thread Tom
source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit

Did anybody run Spark-perf on powerpc?

2015-03-31 Thread Tom
We verified it runs on x86, and are now trying to run it on powerPC. We currently run into dependency trouble with sbt. I tried installing sbt by hand and resolving all dependencies by hand, but must have made an error, as I still get errors. Original error: Getting org.scala-sbt sbt 0.13.6 ...

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
you can use ~ there - IIRC it does not do any kind of variable expansion. On Mon, Mar 30, 2015 at 3:50 PM, Tom thubregt...@gmail.com wrote: I have set spark.eventLog.enabled true as I try to preserve log files. When I run, I get Log directory /tmp/spark-events does not exist. I set

Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom
by hduser. I even performed chmod 777, but Spark keeps on crashing when I run with spark.eventLog.enabled. It works without. Any hints? Thanks, Tom -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-events-does-not-exist-error-while-it-does-with-all-the-req

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
? (It always helps to show the command line you're actually running, and if there's an exception, the first few frames of the stack trace.) On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen thubregt...@gmail.com wrote: Updated spark-defaults and spark-env: Log directory /home/hduser/spark/spark-events

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
listed in the error message (i, ii), created a text file, closed it an viewed it, and deleted it (iii). My findings were reconfirmed by my colleague. Any other ideas? Thanks, Tom On 30 March 2015 at 19:19, Marcelo Vanzin van...@cloudera.com wrote: So, the error below is still showing the invalid

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294) ... sqlCtx.tables() DataFrame[tableName: string, isTemporary: boolean] exit() ~ cat /tmp/test10/part-0 {key:0,value:0} {key:1,value:1} {key:2,value:2} {key:3,value:3} {key:4,value:4} {key:5,value:5} Kind Regards, Tom On 27 March

saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
to expect that Spark create an external table in this case? What is the expected behaviour of saveAsTable with the path option? Setup: running spark locally with spark 1.3.0. Kind Regards, Tom

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
Another follow-up: saveAsTable works as expected when running on hadoop cluster with Hive installed. It's just locally that I'm getting this strange behaviour. Any ideas why this is happening? Kind Regards. Tom On 27 March 2015 at 11:29, Tom Walwyn twal...@gmail.com wrote: We can set a path

Which strategy is used for broadcast variables?

2015-03-11 Thread Tom
paragraph about Broadcast Variables, I read The value is sent to each node only once, using an efficient, BitTorrent-like communication mechanism. - Is the book talking about the proposed BTB from the paper? - Is this currently the default? - If not, what is? Thanks, Tom -- View

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
.pdf. It is expected to scale sub-linearly; i.e., O(log N), where N is the number of machines in your cluster. We evaluated up to 100 machines, and it does follow O(log N) scaling. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen thubregt

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Thanks Mosharaf, for the quick response! Can you maybe give me some pointers to an explanation of this strategy? Or elaborate a bit more on it? Which parts are involved in which way? Where are the time penalties and how scalable is this implementation? Thanks again, Tom On 11 March 2015 at 16

Error when running the terasort branche in a cluster

2015-02-25 Thread Tom
message, I see while (read TeraInputFormat.RECORD_LEN) { - Is it possible that this restricts the branch from running on a cluster? - Did anybody manage to run this branch on a cluster? Thanks, Tom 15/02/25 17:55:42 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, arlab152

Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?

2015-02-22 Thread Tom Vacek
The SparkConf doesn't allow you to set arbitrary variables. You can use SparkContext's HadoopRDD and create a JobConf (with whatever variables you want), and then grab them out of the JobConf in your RecordReader. On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote: Hi, I

Re: OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-18 Thread Tom Walwyn
Rashid iras...@cloudera.com wrote: Hi Tom, there are a couple of things you can do here to make this more efficient. first, I think you can replace your self-join with a groupByKey. on your example data set, this would give you (1, Iterable(2,3)) (4, Iterable(3)) this reduces the amount

Re: OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-17 Thread Tom Walwyn
)) Thanks Best Regards On Wed, Feb 18, 2015 at 12:21 PM, Tom Walwyn twal...@gmail.com wrote: Hi All, I'm a new Spark (and Hadoop) user and I want to find out if the cluster resources I am using are feasible for my use-case. The following is a snippet of code that is causing a OOM exception

OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-17 Thread Tom Walwyn
Hi All, I'm a new Spark (and Hadoop) user and I want to find out if the cluster resources I am using are feasible for my use-case. The following is a snippet of code that is causing a OOM exception in the executor after about 125/1000 tasks during the map stage. val rdd2 = rdd.join(rdd,

PySpark saveAsTextFile gzip

2015-01-15 Thread Tom Seddon
Hi, I've searched but can't seem to find a PySpark example. How do I write compressed text file output to S3 using PySpark saveAsTextFile? Thanks, Tom

Efficient way to split an input data set into different output files

2014-11-19 Thread Tom Seddon
I'm trying to set up a PySpark ETL job that takes in JSON log files and spits out fact table files for upload to Redshift. Is there an efficient way to send different event types to different outputs without having to just read the same cached RDD twice? I have my first RDD which is just a json

Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?

2014-11-11 Thread Tom Seddon
) .set(spark.driver.memory, 26). .set(spark.storage.memoryFraction,1) .set(spark.core.connection.ack.wait.timeout,6000) .set(spark.akka.frameSize,50) Thanks, Tom On 24 October 2014 12:31, htailor hemant.tai...@live.co.uk wrote: Hi All, I am relatively new to spark and currently having

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-11-11 Thread Tom Seddon
Yes please can you share. I am getting this error after expanding my application to include a large broadcast variable. Would be good to know if it can be fixed with configuration. On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com wrote: Can you list what your fix was so

java.library.path

2014-10-05 Thread Tom
Hi, I am trying to call some c code, let's say the compiled file is /path/code, and it has chmod +x. When I call it directly, it works. Now i want to call it from Spark 1.1. My problem is not building it into Spark, but making sure Spark can find it. I have tried:

Question about addFiles()

2014-10-03 Thread Tom Weber
; permission issues if I try? Again, I searched the archives but didn't see any of this, but I'm just getting started so may very well be missing this somewhere. Thanks! Tom

Re: Retrieve dataset of Big Data Benchmark

2014-09-27 Thread Tom
-benchmark/pavlo/text/tiny/crawl) dataset.saveAsTextFile(/home/tom/hadoop/bigDataBenchmark/test/crawl3.txt) If you want to do this more often, or use it directly from the cloud instead of from local (which will be slower), you can add these keys to ./conf/spark-env.sh -- View this message in context

Reduce Tuple2Integer, Integer to Tuple2Integer,ListInteger

2014-09-16 Thread Tom
From my map function I create Tuple2Integer, Integer pairs. Now I want to reduce them, and get something like Tuple2Integer, Listlt;Integer. The only way I found to do this was by treating all variables as String, and in the reduceByKey do /return a._2 + , + b._2/ //in which both are numeric

JavaPairRDDString, Integer to JavaPairRDDString, String based on key

2014-09-10 Thread Tom
Is it possible to generate a JavaPairRDDString, Integer from a JavaPairRDDString, String, where I can also use the key values? I have looked at for instance mapToPair, but this generates a new K/V pair based on the original value, and does not give me information about the key. I need this in the

Return multiple [K,V] pairs from a Java Function

2014-08-24 Thread Tom
Hi, I would like to create multiple key-value pairs, where all keys still can be reduced. For instance, I have the following 2 lines: A,B,C B,D I would like to return the following pairs for the first line: A,B A,C B,A B,C C,A C,B And for the second B,D D,B After a reduce by key, I want to end

Trying to make sense of the actual executed code

2014-08-06 Thread Tom
files/rdd's would be a bonus! Thanks in advance, Tom -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-make-sense-of-the-actual-executed-code-tp11594.html Sent from the Apache Spark User List mailing list archive at Nabble.com

  1   2   >