Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Interesting Jayesh, thanks, I will test. All this code is inherited and it runs, but I don't think its been tested in a distributed context for about 5 years, but yeah I need to get this pushed down, so I'm happy to try anything! :) Tom On Wed, Jun 9, 2021 at 3:37 AM Lalwani, Jayesh wrote: >

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
I've not run it yet, but I've stuck a toSeq on the end, but in reality a Seq just inherits Iterator, right? Flatmap does return a RDD[CrawlData] unless my IDE is lying to me. Tom On Wed, Jun 9, 2021 at 10:54 AM Tom Barber wrote: > Interesting Jayesh, thanks, I will test. > > All this code is

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sean Owen
All these configurations don't matter at all if this is executing on the driver. Returning an Iterator in flatMap is fine though it 'delays' execution until that iterator is evaluated by something, which is normally fine. Does creating this FairFetcher do anything by itself? you're just returning

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Yeah so if I update the FairFetcher to return a seq it makes no real difference. Here's an image of what I'm seeing just for reference: https://pasteboard.co/K5NFrz7.png Because this is databricks I don't have an actual spark submit command but it looks like this: curl -d

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Interesting Sean thanks for that insight, I wasn't aware of that fact, I assume the .persist() at the end of that line doesn't do it? I believe, looking at the output in the SparkUI, it gets to

Spark Standalone Authentication and Encryption

2021-06-09 Thread N, Bharath
Hi Team, We are deploying spark standalone cluster and using features likes rpc authentication with spark.authenticate.secret and encryption also. We have below queries from our Security teams on this topic and need your help. 1. How do we make sure spark.authenticate.secret is not visible to

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Mich Talebzadeh
Hi Tom, Persist() here simply means persist to memory). That is all. You can check UI tab on storage https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence So I gather the code is stuck from your link in the driver. You stated that you tried repartition() but it did not

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sam
Like I said In my previous email, can you try this and let me know how many tasks you see? val repRdd = scoredRdd.repartition(50).cache() repRdd.take(1) Then map operation on repRdd here. I’ve done similar map operations in the past and this works. Thanks. On Wed, Jun 9, 2021 at 11:17 AM Tom

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Chris Martin
One thing I would check is this line: val fetchedRdd = rdd.map(r => (r.getGroup, r)) how many distinct groups do you ended up with? If there's just one then I think you might see the behaviour you observe. Chris On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote: > Also just to follow up on

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Thanks Mich, The key on the first iteration is just a string that says "seed", so it is indeed on the first crawl the same across all of the groups. Further iterations would be different, but I'm not there yet. I was under the impression that a repartition would distribute the tasks. Is that not

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Okay so what happens is that the crawler reads a bunch of solr data, we're not talking GB's just a list of JSON and turns that into a bunch of RDD's that end up in that flatmap that I linked to first. The fair fetcher is an interface to a pluggable backend that basically takes some of the fields

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Thanks Chris All the code I have on both sides is as modern as it allows. Running Spark 3.1.1 and Scala 2.12. I stuck some logging in to check reality: LOG.info("GROUP COUNT: " + fetchedgrp.count()) val cgrp = fetchedgrp.collect() cgrp.foreach(f => { LOG.info("Out1 :" + f._1)

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Sorry Sam, I missed that earlier, I'll give it a spin. To everyone involved, this code is old, and not written by me. If you all go "oooh, you want to distribute the crawls over the cluster, you don't want to do it like that, you should look at XYZ instead" feel free to punt different ways of

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sean Owen
persist() doesn't even persist by itself - just sets it to be persisted when it's executed. key doesn't matter here, nor partitioning, if this code is trying to run things on the driver inadvertently. I don't quite grok what the OSS code you linked to is doing, but it's running some supplied

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
@sam: def score(fetchedRdd: RDD[CrawlData]): RDD[CrawlData] = { val job = this.job.asInstanceOf[SparklerJob] val scoredRdd = fetchedRdd.map(d => ScoreFunction(job, d)) val m = 50 val repRdd = scoredRdd.repartition(m).cache() repRdd.take(1) val scoreUpdateRdd: RDD[SolrInputDocument]

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Yeah to test that I just set the group key to the ID in the record which is a solr supplied UUID, which means effectively you end up with 4000 groups now. On Wed, Jun 9, 2021 at 5:13 PM Chris Martin wrote: > One thing I would check is this line: > > val fetchedRdd = rdd.map(r => (r.getGroup,

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Also just to follow up on that slightly, I did also try off the back of another comment: def score(fetchedRdd: RDD[CrawlData]): RDD[CrawlData] = { val job = this.job.asInstanceOf[SparklerJob] val scoredRdd = fetchedRdd.map(d => ScoreFunction(job, d)) val scoreUpdateRdd:

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Chris Martin
Hmm then my guesses are (in order of decreasing probability: * Whatever class makes up fetchedRdd (MemexDeepCrawlDbRDD?) isn't compatible with the lastest spark release. * You've got 16 threads per task on a 16 core machine. Should be fine, but I wonder if it's confusing things as you don't also

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Mich Talebzadeh
Are you running this in Managed Instance Group (MIG)? https://cloud.google.com/compute/docs/instance-groups view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sean Owen
That looks like you did some work on the cluster, and now it's stuck doing something else on the driver - not doing everything on 1 machine. On Wed, Jun 9, 2021 at 12:43 PM Tom Barber wrote: > And also as this morning: https://pasteboard.co/K5Q9aEf.png > > Removing the cpu pins gives me more

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Yeah :) But it's all running through the same node. So I can run multiple tasks of the same type on the same node(the driver), but I can't run multiple tasks on multiple nodes. On Wed, Jun 9, 2021 at 7:57 PM Sean Owen wrote: > Wait. Isn't that what you were trying to parallelize in the first

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
And also as this morning: https://pasteboard.co/K5Q9aEf.png Removing the cpu pins gives me more tasks but as you can see here: https://pasteboard.co/K5Q9GO0.png It just loads up a single server. On Wed, Jun 9, 2021 at 6:32 PM Tom Barber wrote: > Thanks Chris > > All the code I have on

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sean Owen
Where do you see that ... I see 3 executors busy at first. If that's the crawl then ? On Wed, Jun 9, 2021 at 1:59 PM Tom Barber wrote: > Yeah :) > > But it's all running through the same node. So I can run multiple tasks of > the same type on the same node(the driver), but I can't run multiple

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
No, this is an on demand databricks cluster. On Wed, Jun 9, 2021 at 6:54 PM Mich Talebzadeh wrote: > > > Are you running this in Managed Instance Group (MIG)? > > https://cloud.google.com/compute/docs/instance-groups > > >view my Linkedin profile >

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Yeah but that something else is the crawl being run, which is triggered from inside the RDDs, because the log output is slowly outputting crawl data. On Wed, 9 Jun 2021, 19:47 Sean Owen, wrote: > That looks like you did some work on the cluster, and now it's stuck doing > something else on the

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
Ah no sorry, so in the load image, the crawl has just kicked off on the driver node which is why its flagged red and the load is spiking. https://pasteboard.co/K5QHOJN.png here's the cluster now its been running a while. The red node is still (and is always every time I tested it) the driver node.

Re: NoSuchMethodError: org.apache.spark.network.util.AbstractFileRegion.transferred

2021-06-09 Thread mirkel
I am seeing this issue when running Spark 3.0.2 on YARN. Has a resolution been found for this? (I recentlly upgraded from using Spark 2.x on YARN) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To