],
in the amazon cluster. Is there a way I can download this without being a
user of the Amazon cluster? I tried
bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./
but it asks for an AWS Access Key ID and Secret Access Key which I do not
have.
Thanks in advance,
Tom
--
View
Hi Burak,
Thank you for your pointer, it is really helping out. I do have some
consecutive questions though.
After looking at the Big Data Benchmark page
https://amplab.cs.berkeley.edu/benchmark/ (Section Run this benchmark
yourself), I was expecting the following combination of files:
Sets:
the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey
properties (respectively).
I guess the files are publicly available, but only to registered AWS users,
so I caved in and registered for the service. Using the credentials that I
got I was able to download the files using the local spark shell.
Thanks!
Tom
that substr is supported by
HiveQL, but not by Spark SQL, correct?
Thanks!
Tom
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Substring-in-Spark-SQL-tp11373.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
files/rdd's would be a
bonus!
Thanks in advance,
Tom
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-make-sense-of-the-actual-executed-code-tp11594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi,
I would like to create multiple key-value pairs, where all keys still can be
reduced. For instance, I have the following 2 lines:
A,B,C
B,D
I would like to return the following pairs for the first line:
A,B
A,C
B,A
B,C
C,A
C,B
And for the second
B,D
D,B
After a reduce by key, I want to end
Is it possible to generate a JavaPairRDDString, Integer from a
JavaPairRDDString, String, where I can also use the key values? I have
looked at for instance mapToPair, but this generates a new K/V pair based on
the original value, and does not give me information about the key.
I need this in the
From my map function I create Tuple2Integer, Integer pairs. Now I want to
reduce them, and get something like Tuple2Integer, Listlt;Integer.
The only way I found to do this was by treating all variables as String, and
in the reduceByKey do
/return a._2 + , + b._2/ //in which both are numeric
-benchmark/pavlo/text/tiny/crawl)
dataset.saveAsTextFile(/home/tom/hadoop/bigDataBenchmark/test/crawl3.txt)
If you want to do this more often, or use it directly from the cloud instead
of from local (which will be slower), you can add these keys to
./conf/spark-env.sh
--
View this message in context
Hi,
I am trying to call some c code, let's say the compiled file is /path/code,
and it has chmod +x. When I call it directly, it works. Now i want to call
it from Spark 1.1. My problem is not building it into Spark, but making sure
Spark can find it.
I have tried:
paragraph about Broadcast Variables, I read The value is sent to
each node only once, using an efficient, BitTorrent-like communication
mechanism.
- Is the book talking about the proposed BTB from the paper?
- Is this currently the default?
- If not, what is?
Thanks,
Tom
--
View
by
hduser. I even performed chmod 777, but Spark keeps on crashing when I run
with spark.eventLog.enabled. It works without. Any hints?
Thanks,
Tom
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-events-does-not-exist-error-while-it-does-with-all-the-req
We verified it runs on x86, and are now trying to run it on powerPC. We
currently run into dependency trouble with sbt. I tried installing sbt by
hand and resolving all dependencies by hand, but must have made an error, as
I still get errors.
Original error:
Getting org.scala-sbt sbt 0.13.6 ...
message, I see
while (read TeraInputFormat.RECORD_LEN) {
- Is it possible that this restricts the branch from running on a cluster?
- Did anybody manage to run this branch on a cluster?
Thanks,
Tom
15/02/25 17:55:42 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
arlab152
source code.
My question:
Could you guys please make the source code of the used TeraSort program,
preferably with settings, available? If not, what are the reasons that this
seems to be withheld?
Thanks for any help,
Tom Hubregtsen
[1]
https://github.com/rxin/spark/commit
Thanks,
Tom
P.S. (I know that the data might not end up being uniformly distributed,
example: 4 elements in part-0 and 2 in part-1)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sortByKey-with-multiple-partitions-tp22426.html
Sent from the Apache
helped out with this prototype over Twitter’s hack week.) That work
also calls
the Scala API directly, because it was done before we had a Java API; it should
be easier
with the Java one.
Tom
On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote:
Hi everyone,
We are using
Do we have a list of things we really want to get in for 1.X? Perhaps move
any jira out to a 1.1 release if we aren't targetting them for 1.0.
It might be nice to send out reminders when these dates are approaching.
Tom
On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com
should be able to distribute the things needed to
make a recommendation (either the centroids or the attributes matrix), and
just break up the work based on the users you want to generate
recommendations for. I hope this helps.
Tom
On Sat, Apr 12, 2014 at 11:35 AM, Xiaoli Li lixiaolima
Thomson Reuters is looking for a graduate (or possibly advanced
undergraduate) summer intern in Eagan, MN. This is a chance to work on an
innovative project exploring how big data sets can be used by professionals
such as lawyers, scientists and journalists. If you're subscribed to this
mailing
Here are some out-of-the-box ideas: If the elements lie in a fairly small
range and/or you're willing to work with limited precision, you could use
counting sort. Moreover, you could iteratively find the median using
bisection, which would be associative and commutative. It's easy to think
of
As to your last line: I've used RDD zipping to avoid GC since MyBaseData is
large and doesn't change. I think this is a very good solution to what is
being asked for.
On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell i...@ianoconnell.com wrote:
A mutable map in an object should do what your
I'm not sure what I said came through. RDD zip is not hacky at all, as it
only depends on a user not changing the partitioning. Basically, you would
keep your losses as an RDD[Double] and zip whose with the RDD of examples,
and update the losses. You're doing a copy (and GC) on the RDD of
Right---They are zipped at each iteration.
On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen chesterxgc...@yahoo.comwrote:
Tom,
Are you suggesting two RDDs, one with loss and another for the rest
info, using zip to tie them together, but do update on loss RDD (copy) ?
Chester
Sent from
Ian, I tried playing with your suggestion, but I get a task not
serializable error (and some obvious things didn't fix it). Can you get
that working?
On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek minnesota...@gmail.com wrote:
As to your last line: I've used RDD zipping to avoid GC since
to. For instance, will RDDs of the
same size usually get partitioned to the same machines - thus not
triggering any cross machine aligning, etc. We'll explore it, but I would
still very much like to see more direct worker memory management besides
RDDs.
On Mon, Apr 28, 2014 at 10:26 AM, Tom
either go to the RM UI
to link to the spark history UI or go directly to the spark history server ui.
Tom
On Thursday, May 1, 2014 7:09 PM, Jenny Zhao linlin200...@gmail.com wrote:
Hi,
I have installed spark 1.0 from the branch-1.0, build went fine, and I have
tried running the example
of
all node managers. Thus, this is not applicable to hosted clusters).
Tom
On Monday, May 12, 2014 9:38 AM, Sai Prasanna ansaiprasa...@gmail.com wrote:
Hi All,
I wanted to launch Spark on Yarn, interactive - yarn client mode.
With default settings of yarn-site.xml and spark-env.sh, i
I've done some comparisons with my own implementation of TRON on Spark.
From a distributed computing perspective, it does 2x more local work per
iteration than LBFGS, so the parallel isoefficiency is improved slightly.
I think the truncated Newton solver holds some potential because there
have
to. But they shouldn't have
overlapped as far as both being up at the same time. Is that the case you are
seeing? Generally you want to look at why the first application attempt fails.
Tom
On Wednesday, May 21, 2014 6:10 PM, Kevin Markey kevin.mar...@oracle.com
wrote:
I tested an application on RC-10
Spark gives you four of the classical collectives: broadcast, reduce,
scatter, and gather. There are also a few additional primitives, mostly
based on a join. Spark is certainly less optimized than MPI for these, but
maybe that isn't such a big deal. Spark has one theoretical disadvantage
; permission issues if I try?
Again, I searched the archives but didn't see any of this, but I'm just getting
started so may very well
be missing this somewhere.
Thanks!
Tom
)
.set(spark.driver.memory, 26).
.set(spark.storage.memoryFraction,1)
.set(spark.core.connection.ack.wait.timeout,6000)
.set(spark.akka.frameSize,50)
Thanks,
Tom
On 24 October 2014 12:31, htailor hemant.tai...@live.co.uk wrote:
Hi All,
I am relatively new to spark and currently having
Yes please can you share. I am getting this error after expanding my
application to include a large broadcast variable. Would be good to know if
it can be fixed with configuration.
On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com
wrote:
Can you list what your fix was so
I'm trying to set up a PySpark ETL job that takes in JSON log files and
spits out fact table files for upload to Redshift. Is there an efficient
way to send different event types to different outputs without having to
just read the same cached RDD twice? I have my first RDD which is just a
json
Hi,
I've searched but can't seem to find a PySpark example. How do I write
compressed text file output to S3 using PySpark saveAsTextFile?
Thanks,
Tom
))
Thanks
Best Regards
On Wed, Feb 18, 2015 at 12:21 PM, Tom Walwyn twal...@gmail.com wrote:
Hi All,
I'm a new Spark (and Hadoop) user and I want to find out if the cluster
resources I am using are feasible for my use-case. The following is a
snippet of code that is causing a OOM exception
Hi All,
I'm a new Spark (and Hadoop) user and I want to find out if the cluster
resources I am using are feasible for my use-case. The following is a
snippet of code that is causing a OOM exception in the executor after about
125/1000 tasks during the map stage.
val rdd2 = rdd.join(rdd,
Rashid iras...@cloudera.com wrote:
Hi Tom,
there are a couple of things you can do here to make this more efficient.
first, I think you can replace your self-join with a groupByKey. on your
example data set, this would give you
(1, Iterable(2,3))
(4, Iterable(3))
this reduces the amount
.pdf.
It is expected to scale sub-linearly; i.e., O(log N), where N is the
number of machines in your cluster.
We evaluated up to 100 machines, and it does follow O(log N) scaling.
--
Mosharaf Chowdhury
http://www.mosharaf.com/
On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen thubregt
Thanks Mosharaf, for the quick response! Can you maybe give me some
pointers to an explanation of this strategy? Or elaborate a bit more on it?
Which parts are involved in which way? Where are the time penalties and how
scalable is this implementation?
Thanks again,
Tom
On 11 March 2015 at 16
you can
use ~ there - IIRC it does not do any kind of variable expansion.
On Mon, Mar 30, 2015 at 3:50 PM, Tom thubregt...@gmail.com wrote:
I have set
spark.eventLog.enabled true
as I try to preserve log files. When I run, I get
Log directory /tmp/spark-events does not exist.
I set
?
(It always helps to show the command line you're actually running, and
if there's an exception, the first few frames of the stack trace.)
On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen thubregt...@gmail.com
wrote:
Updated spark-defaults and spark-env:
Log directory /home/hduser/spark/spark-events
listed in the error message (i, ii), created a text file, closed it an
viewed it, and deleted it (iii). My findings were reconfirmed by my
colleague. Any other ideas?
Thanks,
Tom
On 30 March 2015 at 19:19, Marcelo Vanzin van...@cloudera.com wrote:
So, the error below is still showing the invalid
$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294)
...
sqlCtx.tables()
DataFrame[tableName: string, isTemporary: boolean]
exit()
~ cat /tmp/test10/part-0
{key:0,value:0}
{key:1,value:1}
{key:2,value:2}
{key:3,value:3}
{key:4,value:4}
{key:5,value:5}
Kind Regards,
Tom
On 27 March
to expect that Spark create an external table in this case? What
is the expected behaviour of saveAsTable with the path option?
Setup: running spark locally with spark 1.3.0.
Kind Regards,
Tom
Another follow-up: saveAsTable works as expected when running on hadoop
cluster with Hive installed. It's just locally that I'm getting this
strange behaviour. Any ideas why this is happening?
Kind Regards.
Tom
On 27 March 2015 at 11:29, Tom Walwyn twal...@gmail.com wrote:
We can set a path
The SparkConf doesn't allow you to set arbitrary variables. You can use
SparkContext's HadoopRDD and create a JobConf (with whatever variables you
want), and then grab them out of the JobConf in your RecordReader.
On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote:
Hi,
I
Thanks for the responses.
Try removing toDebugString and see what happens.
The toDebugString is performed after [d] (the action), as [e]. By then all
stages are already executed.
--
View this message in context:
]), and with larger input set can also take
a noticeable time. Does anybody have any idea what is running in this
Job/stage 0?
Thanks,
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation
I'm not sure, but I wonder if because you are using the Spark REPL that it
may not be representing what a normal runtime execution would look like and
is possibly eagerly running a partial DAG once you define an operation that
would cause a shuffle.
What happens if you setup your same set of
Thank you for your response Ewan. I quickly looked yesterday and it was
there, but today at work I tried to open it again to start working on it,
but it appears to be removed. Is this correct?
Thanks,
Tom
On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be wrote:
Hi all.
The code
at 12:05 PM Tom Seddon mr.tom.sed...@gmail.com wrote:
Hi,
I've worked out how to use explode on my input avro dataset with the
following structure
root
|-- pageViewId: string (nullable = false)
|-- components: array (nullable = true)
||-- element: struct (containsNull = false
Hi,
I've worked out how to use explode on my input avro dataset with the
following structure
root
|-- pageViewId: string (nullable = false)
|-- components: array (nullable = true)
||-- element: struct (containsNull = false)
|||-- name: string (nullable = false)
|||--
is only available on pairRDD's, this might have something to with it..)
I am using the spark master branch. The error:
[error]
/home/th/spark-1.5.0/spark/IBM_ARL_teraSort_v4-01/src/main/scala/IBM_ARL_teraSort.scala:107:
value partitionBy is not a member of org.apache.spark.sql.DataFrame
Thanks,
Tom
I believe that as you are not persisting anything into the memory space
defined by
spark.storage.memoryFraction
you also have nothing to clear from this area using the unpersist.
FYI: The data will be kept in the OS-buffer/on disk at the point of the
reduce (as this involves a wide dependency -
implemented in
dataFrames (?) and makes me wonder if I then should just use dataFrames in
my regular computation.
Thanks in advance,
Tom
P.S. currently using the master branch from the gitHub
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrames-for-non
to not use HDFS)
* Bonus question: Should I use a different API to get a better performance?
Thanks for any responses!
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html
Sent from
?
Thanks in advance,
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
metrics will someday be included in the Hadoop FileStatistics
API. In the meantime, it is not currently possible to understand how much of
a Spark task's time is spent reading from disk via HDFS.
That said, this might be posted as a footnote at the event timeline to avoid
confusion :)
Best regards,
Tom
Is there anything other then the spark assembly that needs to be in the
classpath? I verified the assembly was built right and its in the classpath
(else nothing would work).
Thanks,Tom
On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman
<shiva...@eecs.berkeley.edu>
I have the following script in a file named test.R:
library(SparkR)
sc <- sparkR.init(master="yarn-client")
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, faithful)
showDF(df)
sparkR.stop()
q(save="no")
If I submit this with "sparkR test.R" or "R CMD BATCH test.R" or
I am running the following command on a Hadoop cluster to launch Spark shell
with DRA:
spark-shell --conf spark.dynamicAllocation.enabled=true --conf
spark.shuffle.service.enabled=true --conf
spark.dynamicAllocation.minExecutors=4 --conf
spark.dynamicAllocation.maxExecutors=12 --conf
n$fit$2.apply(Pipeline.scala:138) at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
Anyone have this working?
Thanks,Tom
I would like to change the logging level for my application running on a
standalone Spark cluster. Is there an easy way to do that without changing
the log4j.properties on each individual node?
Thanks,Tom
)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Thanks,
Tom
d.par"
define my table columns
)
Is something like that possible, does that make any sense?
Thanks
Tom
Thanks for your reply Aniket.
Ok I've done this and I'm still confused. Output from running locally
shows:
file:/home/tom/spark-avro/target/scala-2.10/simpleapp.jar
file:/home/tom/spark-1.4.0-bin-hadoop2.4/conf/
file:/home/tom/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar
this setting could be related.
Would greatly appreciated any advice.
Thanks in advance,
Tom
Hey,
I’m wondering if anyone has run into issues with Spark 1.5 and a FileNotFound
exception with shuffle.index files? It’s been cropping up with very large joins
and aggregations, and causing all of our jobs to fail towards the end. The
memory limit for the executors (we’re running on mesos)
Hi Romi,
Thanks! Could you give me an indication of how much increase the partitions by?
We’ll take a stab in the dark, the input data is around 5M records (though each
record is fairly small). We’ve had trouble both with DataFrames and RDDs.
Tom.
> On 18 Nov 2015, at 12:04, Romi Kuntsman
Solved:
Call spark-submit with
--driver-memory 512m --driver-java-options
"-Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2
-Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2"
Thanks to:
https://issues.apache.org/jira/browse/SPARK-14367
--
View this
Hi,
I am trying to get the same memory behavior in Spark 1.6 as I had in Spark
1.3 with default settings.
I set
--driver-java-options "--Dspark.memory.useLegacyMode=true
-Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6
-Dspark.storage.unrollFraction=0.2"
in Spark 1.6.
But
I would like to also Mich, please send it through, thanks!
On Thu, 12 May 2016 at 15:14 Alonso Isidoro wrote:
> Me too, send me the guide.
>
> Enviado desde mi iPhone
>
> El 12 may 2016, a las 12:11, Ashok Kumar >
re. The return type is an RDD of
> arrays, not of RDDs or of ArrayLists. There may be another catch but
> that is not it.
>
> On Fri, May 13, 2016 at 11:50 AM, Tom Godden <tgod...@vub.ac.be> wrote:
>> I believe it's an illegal cast. This is the line of code:
>>> RDD
I believe it's an illegal cast. This is the line of code:
> RDD> windowed =
> RDDFunctions.fromRDD(vals.rdd(), vals.classTag()).sliding(20, 1);
with vals being a JavaRDD. Explicitly casting
doesn't work either:
> RDD> windowed = (RDD>)
>
pache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions
>
> An RDD of T produces an RDD of T[].
>
> On Fri, May 13, 2016 at 12:10 PM, Tom Godden <tgod...@vub.ac.be> wrote:
>> I assumed the "fixed size blocks" mentioned in the documentation
>&g
the cluster.
I guess there's no solution that fits all, but interested in other people's
experience and whether I've missed anything obvious.
Thanks,
Tom
Thanks Jörn, sounds like there's nothing obvious I'm missing, which is
encouraging.
I've not used Redis, but it does seem that for most of my current and
likely future use-cases it would be the best fit (nice compromise of scale
and easy setup / access).
Thanks,
Tom
On Wed, Sep 14, 2016 at 10
We are happy to announce the availability of Spark 2.2.2!
Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2
maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade
to this stable release. The release notes are available at
I don't know if it all works but some work was done to make cluster manager
pluggable, see SPARK-13904.
Tom
On Wednesday, November 6, 2019, 07:22:59 PM CST, Klaus Ma
wrote:
Any suggestions?
- Klaus
On Mon, Nov 4, 2019 at 5:04 PM Klaus Ma wrote:
Hi team,
AFAIK, we built k8s/yarn
" etc.
<https://stackoverflow.com/users/14147688/tom-scott>
On Tue, Sep 8, 2020 at 10:11 PM Tom Scott wrote:
> Hi Guys,
>
> I asked this in stack overflow here:
> https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-s
ee things like:
scala> someRdd.map(i=>i + ":" +
java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker3
2:worker1
3:worker2
scala> someRdd.map(i=>i + ":" +
java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker2
2:worker3
3:worker1
Am I doing this wrong or is this expected behaviour?
Thanks
Tom
ecause the processing of the
data in the RDD isn't the bottleneck, the fetching of the crawl data is the
bottleneck, but that happens after the code has been assigned to a node.
Thanks
Tom
-
To unsubscribe e-mail: user-un
gt; how many partitions does the groupByKey produce? that would limit your
> parallelism no matter what if it's a small number.
>
> On Tue, Jun 8, 2021 at 8:07 PM Tom Barber wrote:
>
> > Hi folks,
> >
> > Hopefully someone with more Spark experience than me can ex
For anyone interested here's the execution logs up until the point where it
actually kicks off the workload in question:
https://gist.github.com/buggtb/a9e0445f24182bc8eedfe26c0f07a473
On 2021/06/09 01:52:39, Tom Barber wrote:
> ExecutorID says driver, and looking at the IP addresses
Interesting Jayesh, thanks, I will test.
All this code is inherited and it runs, but I don't think its been tested
in a distributed context for about 5 years, but yeah I need to get this
pushed down, so I'm happy to try anything! :)
Tom
On Wed, Jun 9, 2021 at 3:37 AM Lalwani, Jayesh wrote
I've not run it yet, but I've stuck a toSeq on the end, but in reality a
Seq just inherits Iterator, right?
Flatmap does return a RDD[CrawlData] unless my IDE is lying to me.
Tom
On Wed, Jun 9, 2021 at 10:54 AM Tom Barber wrote:
> Interesting Jayesh, thanks, I will test.
>
> All
bfs:/FileStore/bcf/sparkler7.jar","crawl","-id","mytestcrawl11",
"-tn", "5000", "-co",
"{\"plugins.active\":[\"urlfilter-regex\",\"urlfilter-samehost\",\"fetcher-chrome\"],\"plugins\&
; I think we need more info about what else is happening in the code.
>
> On Wed, Jun 9, 2021 at 6:30 AM Tom Barber wrote:
>
>> Yeah so if I update the FairFetcher to return a seq it makes no real
>> difference.
>>
>> Here's an image of what I'm seeing just for r
the tasks. Is that not the
case?
Thanks
Tom
On Wed, Jun 9, 2021 at 3:44 PM Mich Talebzadeh
wrote:
> Hi Tom,
>
> Persist() here simply means persist to memory). That is all. You can check
> UI tab on storage
>
>
> https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persi
teRdd, scoreUpdateFunc)
When its doing stuff in the SparkUI I can see that its waiting on the
sc.runJob() line, so thats the execution point.
Tom
On Wed, Jun 9, 2021 at 3:59 PM Sean Owen wrote:
> persist() doesn't even persist by itself - just sets it to be persisted
> when it's execute
se checks out.
I'll poke around in the other hints you suggested later, thanks for the
help.
Tom
On Wed, Jun 9, 2021 at 5:49 PM Chris Martin wrote:
> Hmm then my guesses are (in order of decreasing probability:
>
> * Whatever class makes up fetchedRdd (MemexDeepCrawlDbRDD?) isn't
> compati
ache()
> repRdd.take(1)
> Then map operation on repRdd here.
>
> I’ve done similar map operations in the past and this works.
>
> Thanks.
>
> On Wed, Jun 9, 2021 at 11:17 AM Tom Barber wrote:
>
>> Also just to follow up on that slightly, I did also try off the back
ent] = repRdd.map(d =>
ScoreUpdateSolrTransformer(d))
I did that, but the crawl is executed in that repartition executor (which I
should have pointed out I already know).
Tom
On Wed, Jun 9, 2021 at 4:37 PM Tom Barber wrote:
> Sorry Sam, I missed that earlier, I'll give it a spin.
>
>
.getGroup, r))
>
> how many distinct groups do you ended up with? If there's just one then I
> think you might see the behaviour you observe.
>
> Chris
>
>
> On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote:
>
>> Also just to follow up on that slightly, I di
RDD[SolrInputDocument] =
scoredRdd.repartition(50).map(d => ScoreUpdateSolrTransformer(d))
Where I repartitioned that scoredRdd map out of interest, it then triggers
the FairFetcher function there, instead of in the runJob(), but still on a
single executor
Tom
On Wed, Jun 9, 2021 at 4:11 PM Tom Barber
b) how it divides up partitions to tasks
c) the fact its a POJO and not a file of stuff.
Or probably some of all 3.
Tom
On Wed, Jun 23, 2021 at 11:44 AM Tom Barber wrote:
> (I should point out that I'm diagnosing this by looking at the active
> tasks https://pasteboard.co/K7VryDJ.png, if
how to split that flatmap
operation up so the RDD processing runs across the nodes, not limited to a
single node?
Thanks for all your help so far,
Tom
On Wed, Jun 9, 2021 at 8:08 PM Tom Barber wrote:
> Ah no sorry, so in the load image, the crawl has just kicked off on the
> driver node which
(I should point out that I'm diagnosing this by looking at the active tasks
https://pasteboard.co/K7VryDJ.png, if I'm reading it incorrectly, let me
know)
On Wed, Jun 23, 2021 at 11:38 AM Tom Barber wrote:
> Uff hello fine people.
>
> So the cause of the above issue was, unsur
1 - 100 of 119 matches
Mail list logo