This is kinda a how-long-is-a-piece-of-string question. There is no
one tuning for 'terabytes of data'. You can easily run a Spark job
that processes hundreds of terabytes with no problem with defaults --
something trivial like counting. You can create Spark jobs that will
never complete -- trying
Often when this happens to me, it is actually an exception parsing a few
messages. Easy to miss this, as error messages aren't always informative. I
would be blaming spark, but in reality it was missing fields in a CSV file.
As has been said, make a file with a few records and see if your job work
Did you try it with a smaller subset of the data first?
Le 23 janv. 2015 05:54, "Kane Kim" a écrit :
> I'm trying to process 5TB of data, not doing anything fancy, just
> map/filter and reduceByKey. Spent whole day today trying to get it
> processed, but never succeeded. I've tried to deploy to e
I'm trying to process 5TB of data, not doing anything fancy, just
map/filter and reduceByKey. Spent whole day today trying to get it
processed, but never succeeded. I've tried to deploy to ec2 with the
script provided with spark on pretty beefy machines (100 r3.2xlarge
nodes). Really frustrated tha