from:"Tom"

Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Tom

], in the amazon cluster. Is there a way I can download this without being a user of the Amazon cluster? I tried bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./ but it asks for an AWS Access Key ID and Secret Access Key which I do not have. Thanks in advance, Tom -- View

Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Tom

Hi Burak, Thank you for your pointer, it is really helping out. I do have some consecutive questions though. After looking at the Big Data Benchmark page https://amplab.cs.berkeley.edu/benchmark/ (Section Run this benchmark yourself), I was expecting the following combination of files: Sets:

Re: Retrieve dataset of Big Data Benchmark

2014-07-17 Thread Tom

the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). I guess the files are publicly available, but only to registered AWS users, so I caved in and registered for the service. Using the credentials that I got I was able to download the files using the local spark shell. Thanks! Tom

Substring in Spark SQL

2014-08-04 Thread Tom

that substr is supported by HiveQL, but not by Spark SQL, correct? Thanks! Tom -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Substring-in-Spark-SQL-tp11373.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Trying to make sense of the actual executed code

2014-08-06 Thread Tom

files/rdd's would be a bonus! Thanks in advance, Tom -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-make-sense-of-the-actual-executed-code-tp11594.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Return multiple [K,V] pairs from a Java Function

2014-08-24 Thread Tom

Hi, I would like to create multiple key-value pairs, where all keys still can be reduced. For instance, I have the following 2 lines: A,B,C B,D I would like to return the following pairs for the first line: A,B A,C B,A B,C C,A C,B And for the second B,D D,B After a reduce by key, I want to end

JavaPairRDDString, Integer to JavaPairRDDString, String based on key

2014-09-10 Thread Tom

Is it possible to generate a JavaPairRDDString, Integer from a JavaPairRDDString, String, where I can also use the key values? I have looked at for instance mapToPair, but this generates a new K/V pair based on the original value, and does not give me information about the key. I need this in the

Reduce Tuple2Integer, Integer to Tuple2Integer,ListInteger

2014-09-16 Thread Tom

From my map function I create Tuple2Integer, Integer pairs. Now I want to reduce them, and get something like Tuple2Integer, Listlt;Integer. The only way I found to do this was by treating all variables as String, and in the reduceByKey do /return a._2 + , + b._2/ //in which both are numeric

Re: Retrieve dataset of Big Data Benchmark

2014-09-27 Thread Tom

-benchmark/pavlo/text/tiny/crawl) dataset.saveAsTextFile(/home/tom/hadoop/bigDataBenchmark/test/crawl3.txt) If you want to do this more often, or use it directly from the cloud instead of from local (which will be slower), you can add these keys to ./conf/spark-env.sh -- View this message in context

java.library.path

2014-10-05 Thread Tom

Hi, I am trying to call some c code, let's say the compiled file is /path/code, and it has chmod +x. When I call it directly, it works. Now i want to call it from Spark 1.1. My problem is not building it into Spark, but making sure Spark can find it. I have tried:

Which strategy is used for broadcast variables?

2015-03-11 Thread Tom

paragraph about Broadcast Variables, I read The value is sent to each node only once, using an efficient, BitTorrent-like communication mechanism. - Is the book talking about the proposed BTB from the paper? - Is this currently the default? - If not, what is? Thanks, Tom -- View

Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom

by hduser. I even performed chmod 777, but Spark keeps on crashing when I run with spark.eventLog.enabled. It works without. Any hints? Thanks, Tom -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-events-does-not-exist-error-while-it-does-with-all-the-req

Did anybody run Spark-perf on powerpc?

2015-03-31 Thread Tom

We verified it runs on x86, and are now trying to run it on powerPC. We currently run into dependency trouble with sbt. I tried installing sbt by hand and resolving all dependencies by hand, but must have made an error, as I still get errors. Original error: Getting org.scala-sbt sbt 0.13.6 ...

Error when running the terasort branche in a cluster

2015-02-25 Thread Tom

message, I see while (read TeraInputFormat.RECORD_LEN) { - Is it possible that this restricts the branch from running on a cluster? - Did anybody manage to run this branch on a cluster? Thanks, Tom 15/02/25 17:55:42 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, arlab152

Spark TeraSort source request

2015-04-03 Thread Tom

source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit

sortByKey with multiple partitions

2015-04-08 Thread Tom

Thanks, Tom P.S. (I know that the data might not end up being uniformly distributed, example: 4 elements in part-0 and 2 in part-1) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sortByKey-with-multiple-partitions-tp22426.html Sent from the Apache

Re: Pig on Spark

2014-03-06 Thread Tom Graves

helped out with this prototype over Twitter’s hack week.) That work also calls the Scala API directly, because it was done before we had a Java API; it should be easier with the Java one. Tom On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, We are using

Re: Spark 1.0.0 release plan

2014-04-04 Thread Tom Graves

Do we have a list of things we really want to get in for 1.X? Perhaps move any jira out to a 1.1 release if we aren't targetting them for 1.0. It might be nice to send out reminders when these dates are approaching. Tom On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com

Re: Huge matrix

2014-04-12 Thread Tom V

should be able to distribute the things needed to make a recommendation (either the centroids or the attributes matrix), and just break up the work based on the users you want to generate recommendations for. I hope this helps. Tom On Sat, Apr 12, 2014 at 11:35 AM, Xiaoli Li lixiaolima

internship opportunity

2014-04-22 Thread Tom Vacek

Thomson Reuters is looking for a graduate (or possibly advanced undergraduate) summer intern in Eagan, MN. This is a chance to work on an innovative project exploring how big data sets can be used by professionals such as lawyers, scientists and journalists. If you're subscribed to this mailing

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Tom Vacek

Here are some out-of-the-box ideas: If the elements lie in a fairly small range and/or you're willing to work with limited precision, you could use counting sort. Moreover, you could iteratively find the median using bisection, which would be associative and commutative. It's easy to think of

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek

As to your last line: I've used RDD zipping to avoid GC since MyBaseData is large and doesn't change. I think this is a very good solution to what is being asked for. On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell i...@ianoconnell.com wrote: A mutable map in an object should do what your

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek

I'm not sure what I said came through. RDD zip is not hacky at all, as it only depends on a user not changing the partitioning. Basically, you would keep your losses as an RDD[Double] and zip whose with the RDD of examples, and update the losses. You're doing a copy (and GC) on the RDD of

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek

Right---They are zipped at each iteration. On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen chesterxgc...@yahoo.comwrote: Tom, Are you suggesting two RDDs, one with loss and another for the rest info, using zip to tie them together, but do update on loss RDD (copy) ? Chester Sent from

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek

Ian, I tried playing with your suggestion, but I get a task not serializable error (and some obvious things didn't fix it). Can you get that working? On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek minnesota...@gmail.com wrote: As to your last line: I've used RDD zipping to avoid GC since

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek

to. For instance, will RDDs of the same size usually get partitioned to the same machines - thus not triggering any cross machine aligning, etc. We'll explore it, but I would still very much like to see more direct worker memory management besides RDDs. On Mon, Apr 28, 2014 at 10:26 AM, Tom

Re: configure spark history server for running on Yarn

2014-05-05 Thread Tom Graves

either go to the RM UI to link to the spark history UI or go directly to the spark history server ui. Tom On Thursday, May 1, 2014 7:09 PM, Jenny Zhao linlin200...@gmail.com wrote: Hi, I have installed spark 1.0 from the branch-1.0, build went fine, and I have tried running the example

Re: Spark on Yarn - A small issue !

2014-05-14 Thread Tom Graves

of all node managers. Thus, this is not applicable to hosted clusters). Tom On Monday, May 12, 2014 9:38 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All, I wanted to launch Spark on Yarn, interactive - yarn client mode. With default settings of yarn-site.xml and spark-env.sh, i

Re: Spark LIBLINEAR

2014-05-16 Thread Tom Vacek

I've done some comparisons with my own implementation of TRON on Spark. From a distributed computing perspective, it does 2x more local work per iteration than LBFGS, so the parallel isoefficiency is improved slightly. I think the truncated Newton solver holds some potential because there have

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tom Graves

to. But they shouldn't have overlapped as far as both being up at the same time. Is that the case you are seeing? Generally you want to look at why the first application attempt fails. Tom On Wednesday, May 21, 2014 6:10 PM, Kevin Markey kevin.mar...@oracle.com wrote: I tested an application on RC-10

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Tom Vacek

Spark gives you four of the classical collectives: broadcast, reduce, scatter, and gather. There are also a few additional primitives, mostly based on a join. Spark is certainly less optimized than MPI for these, but maybe that isn't such a big deal. Spark has one theoretical disadvantage

Question about addFiles()

2014-10-03 Thread Tom Weber

; permission issues if I try? Again, I searched the archives but didn't see any of this, but I'm just getting started so may very well be missing this somewhere. Thanks! Tom

Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?

2014-11-11 Thread Tom Seddon

) .set(spark.driver.memory, 26). .set(spark.storage.memoryFraction,1) .set(spark.core.connection.ack.wait.timeout,6000) .set(spark.akka.frameSize,50) Thanks, Tom On 24 October 2014 12:31, htailor hemant.tai...@live.co.uk wrote: Hi All, I am relatively new to spark and currently having

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-11-11 Thread Tom Seddon

Yes please can you share. I am getting this error after expanding my application to include a large broadcast variable. Would be good to know if it can be fixed with configuration. On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com wrote: Can you list what your fix was so

Efficient way to split an input data set into different output files

2014-11-19 Thread Tom Seddon

I'm trying to set up a PySpark ETL job that takes in JSON log files and spits out fact table files for upload to Redshift. Is there an efficient way to send different event types to different outputs without having to just read the same cached RDD twice? I have my first RDD which is just a json

PySpark saveAsTextFile gzip

2015-01-15 Thread Tom Seddon

Hi, I've searched but can't seem to find a PySpark example. How do I write compressed text file output to S3 using PySpark saveAsTextFile? Thanks, Tom

Re: OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-17 Thread Tom Walwyn

)) Thanks Best Regards On Wed, Feb 18, 2015 at 12:21 PM, Tom Walwyn twal...@gmail.com wrote: Hi All, I'm a new Spark (and Hadoop) user and I want to find out if the cluster resources I am using are feasible for my use-case. The following is a snippet of code that is causing a OOM exception

OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-17 Thread Tom Walwyn

Hi All, I'm a new Spark (and Hadoop) user and I want to find out if the cluster resources I am using are feasible for my use-case. The following is a snippet of code that is causing a OOM exception in the executor after about 125/1000 tasks during the map stage. val rdd2 = rdd.join(rdd,

Re: OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-18 Thread Tom Walwyn

Rashid iras...@cloudera.com wrote: Hi Tom, there are a couple of things you can do here to make this more efficient. first, I think you can replace your self-join with a groupByKey. on your example data set, this would give you (1, Iterable(2,3)) (4, Iterable(3)) this reduces the amount

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen

.pdf. It is expected to scale sub-linearly; i.e., O(log N), where N is the number of machines in your cluster. We evaluated up to 100 machines, and it does follow O(log N) scaling. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen thubregt

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen

Thanks Mosharaf, for the quick response! Can you maybe give me some pointers to an explanation of this strategy? Or elaborate a bit more on it? Which parts are involved in which way? Where are the time penalties and how scalable is this implementation? Thanks again, Tom On 11 March 2015 at 16

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen

you can use ~ there - IIRC it does not do any kind of variable expansion. On Mon, Mar 30, 2015 at 3:50 PM, Tom thubregt...@gmail.com wrote: I have set spark.eventLog.enabled true as I try to preserve log files. When I run, I get Log directory /tmp/spark-events does not exist. I set

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen

? (It always helps to show the command line you're actually running, and if there's an exception, the first few frames of the stack trace.) On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen thubregt...@gmail.com wrote: Updated spark-defaults and spark-env: Log directory /home/hduser/spark/spark-events

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen

listed in the error message (i, ii), created a text file, closed it an viewed it, and deleted it (iii). My findings were reconfirmed by my colleague. Any other ideas? Thanks, Tom On 30 March 2015 at 19:19, Marcelo Vanzin van...@cloudera.com wrote: So, the error below is still showing the invalid

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn

$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294) ... sqlCtx.tables() DataFrame[tableName: string, isTemporary: boolean] exit() ~ cat /tmp/test10/part-0 {key:0,value:0} {key:1,value:1} {key:2,value:2} {key:3,value:3} {key:4,value:4} {key:5,value:5} Kind Regards, Tom On 27 March

saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn

to expect that Spark create an external table in this case? What is the expected behaviour of saveAsTable with the path option? Setup: running spark locally with spark 1.3.0. Kind Regards, Tom

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn

Another follow-up: saveAsTable works as expected when running on hadoop cluster with Hive installed. It's just locally that I'm getting this strange behaviour. Any ideas why this is happening? Kind Regards. Tom On 27 March 2015 at 11:29, Tom Walwyn twal...@gmail.com wrote: We can set a path

Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?

2015-02-22 Thread Tom Vacek

The SparkConf doesn't allow you to set arbitrary variables. You can use SparkContext's HadoopRDD and create a JobConf (with whatever variables you want), and then grab them out of the JobConf in your RecordReader. On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote: Hi, I

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen

Thanks for the responses. Try removing toDebugString and see what happens. The toDebugString is performed after [d] (the action), as [e]. By then all stages are already executed. -- View this message in context:

Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen

]), and with larger input set can also take a noticeable time. Does anybody have any idea what is running in this Job/stage 0? Thanks, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen

I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of

Re: Spark TeraSort source request

2015-04-13 Thread Tom Hubregtsen

Thank you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct? Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be wrote: Hi all. The code

Re: SparkSQL DF.explode with Nulls

2015-06-05 Thread Tom Seddon

at 12:05 PM Tom Seddon mr.tom.sed...@gmail.com wrote: Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false

SparkSQL DF.explode with Nulls

2015-06-04 Thread Tom Seddon

Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false) |||-- name: string (nullable = false) |||--

PartitionBy/Partitioner for dataFrames?

2015-06-21 Thread Tom Hubregtsen

is only available on pairRDD's, this might have something to with it..) I am using the spark master branch. The error: [error] /home/th/spark-1.5.0/spark/IBM_ARL_teraSort_v4-01/src/main/scala/IBM_ARL_teraSort.scala:107: value partitionBy is not a member of org.apache.spark.sql.DataFrame Thanks, Tom

Re: Un-persist RDD in a loop

2015-06-23 Thread Tom Hubregtsen

I believe that as you are not persisting anything into the memory space defined by spark.storage.memoryFraction you also have nothing to clear from this area using the unpersist. FYI: The data will be kept in the OS-buffer/on disk at the point of the reduce (as this involves a wide dependency -

DataFrames for non-SQL computation?

2015-06-11 Thread Tom Hubregtsen

implemented in dataFrames (?) and makes me wonder if I then should just use dataFrames in my regular computation. Thanks in advance, Tom P.S. currently using the master branch from the gitHub -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrames-for-non

50% performance decrease when using local file vs hdfs

2015-07-24 Thread Tom Hubregtsen

to not use HDFS) * Bonus question: Should I use a different API to get a better performance? Thanks for any responses! Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html Sent from

Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen

? Thanks in advance, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen

metrics will someday be included in the Hadoop FileStatistics API. In the meantime, it is not currently possible to understand how much of a Spark task's time is spent reading from disk via HDFS. That said, this might be posted as a footnote at the event timeline to avoid confusion :) Best regards, Tom

Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-11 Thread Tom Graves

Is there anything other then the spark assembly that needs to be in the classpath? I verified the assembly was built right and its in the classpath (else nothing would work). Thanks,Tom On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman <shiva...@eecs.berkeley.edu>

sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-10-30 Thread Tom Stewart

I have the following script in a file named test.R: library(SparkR) sc <- sparkR.init(master="yarn-client") sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) showDF(df) sparkR.stop() q(save="no") If I submit this with "sparkR test.R" or "R CMD BATCH test.R" or

Spark 1.5.1 Dynamic Resource Allocation

2015-10-30 Thread Tom Stewart

I am running the following command on a Hadoop cluster to launch Spark shell with DRA: spark-shell --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=4 --conf spark.dynamicAllocation.maxExecutors=12 --conf

anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-06 Thread Tom Graves

n$fit$2.apply(Pipeline.scala:138) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134) Anyone have this working? Thanks,Tom

Changing application log level in standalone cluster

2015-10-13 Thread Tom Graves

I would like to change the logging level for my application running on a standalone Spark cluster. Is there an easy way to do that without changing the log4j.properties on each individual node? Thanks,Tom

java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon

) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks, Tom

Help getting Spark JDBC metadata

2015-09-09 Thread Tom Barber

d.par" define my table columns ) Is something like that possible, does that make any sense? Thanks Tom

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Tom Seddon

Thanks for your reply Aniket. Ok I've done this and I'm still confused. Output from running locally shows: file:/home/tom/spark-avro/target/scala-2.10/simpleapp.jar file:/home/tom/spark-1.4.0-bin-hadoop2.4/conf/ file:/home/tom/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar

What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon

this setting could be related. Would greatly appreciated any advice. Thanks in advance, Tom

Shuffle FileNotFound Exception

2015-11-18 Thread Tom Arnfeld

Hey, I’m wondering if anyone has run into issues with Spark 1.5 and a FileNotFound exception with shuffle.index files? It’s been cropping up with very large joins and aggregations, and causing all of our jobs to fail towards the end. The memory limit for the executors (we’re running on mesos)

Re: Shuffle FileNotFound Exception

2015-11-18 Thread Tom Arnfeld

Hi Romi, Thanks! Could you give me an indication of how much increase the partitions by? We’ll take a stab in the dark, the input data is around 5M records (though each record is fairly small). We’ve had trouble both with DataFrames and RDDs. Tom. > On 18 Nov 2015, at 12:04, Romi Kuntsman

Re: Using spark.memory.useLegacyMode true does not yield expected behavior

2016-04-11 Thread Tom Hubregtsen

Solved: Call spark-submit with --driver-memory 512m --driver-java-options "-Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" Thanks to: https://issues.apache.org/jira/browse/SPARK-14367 -- View this

Using spark.memory.useLegacyMode true does not yield expected behavior

2016-03-29 Thread Tom Hubregtsen

Hi, I am trying to get the same memory behavior in Spark 1.6 as I had in Spark 1.3 with default settings. I set --driver-java-options "--Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" in Spark 1.6. But

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Tom Ellis

I would like to also Mich, please send it through, thanks! On Thu, 12 May 2016 at 15:14 Alonso Isidoro wrote: > Me too, send me the guide. > > Enviado desde mi iPhone > > El 12 may 2016, a las 12:11, Ashok Kumar >

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden

re. The return type is an RDD of > arrays, not of RDDs or of ArrayLists. There may be another catch but > that is not it. > > On Fri, May 13, 2016 at 11:50 AM, Tom Godden <tgod...@vub.ac.be> wrote: >> I believe it's an illegal cast. This is the line of code: >>> RDD

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden

I believe it's an illegal cast. This is the line of code: > RDD> windowed = > RDDFunctions.fromRDD(vals.rdd(), vals.classTag()).sliding(20, 1); with vals being a JavaRDD. Explicitly casting doesn't work either: > RDD> windowed = (RDD>) >

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden

pache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions > > An RDD of T produces an RDD of T[]. > > On Fri, May 13, 2016 at 12:10 PM, Tom Godden <tgod...@vub.ac.be> wrote: >> I assumed the "fixed size blocks" mentioned in the documentation >&g

Streaming - lookup against reference data

2016-09-14 Thread Tom Davis

the cluster. I guess there's no solution that fits all, but interested in other people's experience and whether I've missed anything obvious. Thanks, Tom

Re: Streaming - lookup against reference data

2016-09-15 Thread Tom Davis

Thanks Jörn, sounds like there's nothing obvious I'm missing, which is encouraging. I've not used Redis, but it does seem that for most of my current and likely future use-cases it would be the best fit (nice compromise of scale and easy setup / access). Thanks, Tom On Wed, Sep 14, 2016 at 10

[ANNOUNCE] Apache Spark 2.2.2

2018-07-10 Thread Tom Graves

We are happy to announce the availability of Spark 2.2.2! Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2 maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade to this stable release. The release notes are available at

Re: Build customized resource manager

2019-11-08 Thread Tom Graves

I don't know if it all works but some work was done to make cluster manager pluggable, see SPARK-13904. Tom On Wednesday, November 6, 2019, 07:22:59 PM CST, Klaus Ma wrote: Any suggestions? - Klaus On Mon, Nov 4, 2019 at 5:04 PM Klaus Ma wrote: Hi team, AFAIK, we built k8s/yarn

Re: [Spark Core] makeRDD() preferredLocations do not appear to be considered

2020-09-12 Thread Tom Scott

" etc. <https://stackoverflow.com/users/14147688/tom-scott> On Tue, Sep 8, 2020 at 10:11 PM Tom Scott wrote: > Hi Guys, > > I asked this in stack overflow here: > https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-s

[Spark Core] makeRDD() preferredLocations do not appear to be considered

2020-09-08 Thread Tom Scott

ee things like: scala> someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println) 1:worker3 2:worker1 3:worker2 scala> someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println) 1:worker2 2:worker3 3:worker1 Am I doing this wrong or is this expected behaviour? Thanks Tom

Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber

ecause the processing of the data in the RDD isn't the bottleneck, the fetching of the crawl data is the bottleneck, but that happens after the code has been assigned to a node. Thanks Tom - To unsubscribe e-mail: user-un

Re: Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber

gt; how many partitions does the groupByKey produce? that would limit your > parallelism no matter what if it's a small number. > > On Tue, Jun 8, 2021 at 8:07 PM Tom Barber wrote: > > > Hi folks, > > > > Hopefully someone with more Spark experience than me can ex

Re: Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber

For anyone interested here's the execution logs up until the point where it actually kicks off the workload in question: https://gist.github.com/buggtb/a9e0445f24182bc8eedfe26c0f07a473 On 2021/06/09 01:52:39, Tom Barber wrote: > ExecutorID says driver, and looking at the IP addresses

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

Interesting Jayesh, thanks, I will test. All this code is inherited and it runs, but I don't think its been tested in a distributed context for about 5 years, but yeah I need to get this pushed down, so I'm happy to try anything! :) Tom On Wed, Jun 9, 2021 at 3:37 AM Lalwani, Jayesh wrote

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

I've not run it yet, but I've stuck a toSeq on the end, but in reality a Seq just inherits Iterator, right? Flatmap does return a RDD[CrawlData] unless my IDE is lying to me. Tom On Wed, Jun 9, 2021 at 10:54 AM Tom Barber wrote: > Interesting Jayesh, thanks, I will test. > > All

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

bfs:/FileStore/bcf/sparkler7.jar","crawl","-id","mytestcrawl11", "-tn", "5000", "-co", "{\"plugins.active\":[\"urlfilter-regex\",\"urlfilter-samehost\",\"fetcher-chrome\"],\"plugins\&

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

; I think we need more info about what else is happening in the code. > > On Wed, Jun 9, 2021 at 6:30 AM Tom Barber wrote: > >> Yeah so if I update the FairFetcher to return a seq it makes no real >> difference. >> >> Here's an image of what I'm seeing just for r

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

the tasks. Is that not the case? Thanks Tom On Wed, Jun 9, 2021 at 3:44 PM Mich Talebzadeh wrote: > Hi Tom, > > Persist() here simply means persist to memory). That is all. You can check > UI tab on storage > > > https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persi

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

teRdd, scoreUpdateFunc) When its doing stuff in the SparkUI I can see that its waiting on the sc.runJob() line, so thats the execution point. Tom On Wed, Jun 9, 2021 at 3:59 PM Sean Owen wrote: > persist() doesn't even persist by itself - just sets it to be persisted > when it's execute

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

se checks out. I'll poke around in the other hints you suggested later, thanks for the help. Tom On Wed, Jun 9, 2021 at 5:49 PM Chris Martin wrote: > Hmm then my guesses are (in order of decreasing probability: > > * Whatever class makes up fetchedRdd (MemexDeepCrawlDbRDD?) isn't > compati

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

ache() > repRdd.take(1) > Then map operation on repRdd here. > > I’ve done similar map operations in the past and this works. > > Thanks. > > On Wed, Jun 9, 2021 at 11:17 AM Tom Barber wrote: > >> Also just to follow up on that slightly, I did also try off the back

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

ent] = repRdd.map(d => ScoreUpdateSolrTransformer(d)) I did that, but the crawl is executed in that repartition executor (which I should have pointed out I already know). Tom On Wed, Jun 9, 2021 at 4:37 PM Tom Barber wrote: > Sorry Sam, I missed that earlier, I'll give it a spin. > >

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

.getGroup, r)) > > how many distinct groups do you ended up with? If there's just one then I > think you might see the behaviour you observe. > > Chris > > > On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote: > >> Also just to follow up on that slightly, I di

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber

RDD[SolrInputDocument] = scoredRdd.repartition(50).map(d => ScoreUpdateSolrTransformer(d)) Where I repartitioned that scoredRdd map out of interest, it then triggers the FairFetcher function there, instead of in the runJob(), but still on a single executor  Tom On Wed, Jun 9, 2021 at 4:11 PM Tom Barber

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber

b) how it divides up partitions to tasks c) the fact its a POJO and not a file of stuff. Or probably some of all 3. Tom On Wed, Jun 23, 2021 at 11:44 AM Tom Barber wrote: > (I should point out that I'm diagnosing this by looking at the active > tasks https://pasteboard.co/K7VryDJ.png, if

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber

how to split that flatmap operation up so the RDD processing runs across the nodes, not limited to a single node? Thanks for all your help so far, Tom On Wed, Jun 9, 2021 at 8:08 PM Tom Barber wrote: > Ah no sorry, so in the load image, the crawl has just kicked off on the > driver node which

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber

(I should point out that I'm diagnosing this by looking at the active tasks https://pasteboard.co/K7VryDJ.png, if I'm reading it incorrectly, let me know) On Wed, Jun 23, 2021 at 11:38 AM Tom Barber wrote: > Uff hello fine people. > > So the cause of the above issue was, unsur

1 2 >

1 - 100 of 119 matches

Mail list logo