-1 from me...same FetchFailed issue as what Hector saw... I am running Netflix dataset and dumping out recommendation for all users. It shuffles around 100 GB data on disk to run a reduceByKey per user on utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset...
I gave Spark 10 nodes, 8 cores, 160 GB of memory. Fails with the following FetchFailed errors. 14/11/23 11:51:22 WARN TaskSetManager: Lost task 28.0 in stage 188.0 (TID 2818, tblpmidn08adv-hdp.tdc.vzwcorp.com): FetchFailed(BlockManagerId(1, tblpmidn03adv-hdp.tdc.vzwcorp.com, 52528, 0), shuffleId=35, mapId=28, reduceId=28) It's a consistent behavior on master as well. I tested it both on YARN and Standalone. I compiled spark-1.1 branch (assuming it has all the fixes from RC2 tag. I am now compiling spark-1.0 branch and see if this issue shows up there as well. If it is related to hash/sort based shuffle most likely it won't show up on 1.0. Thanks. Deb On Thu, Nov 20, 2014 at 12:16 PM, Hector Yee <hector....@gmail.com> wrote: > Whoops I must have used the 1.2 preview and mixed them up. > > spark-shell -version shows version 1.2.0 > > Will update the bug https://issues.apache.org/jira/browse/SPARK-4516 to > 1.2 > > On Thu, Nov 20, 2014 at 11:59 AM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > > > Ah, I see. But the spark.shuffle.blockTransferService property doesn't > > exist in 1.1 (AFAIK) -- what exactly are you doing to get this problem? > > > > Matei > > > > On Nov 20, 2014, at 11:50 AM, Hector Yee <hector....@gmail.com> wrote: > > > > This is whatever was in http://people.apache.org/~andrewor14/spark-1 > > .1.1-rc2/ > > > > On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia <matei.zaha...@gmail.com > > > > wrote: > > > >> Hector, is this a comment on 1.1.1 or on the 1.2 preview? > >> > >> Matei > >> > >> > On Nov 20, 2014, at 11:39 AM, Hector Yee <hector....@gmail.com> > wrote: > >> > > >> > I think it is a race condition caused by netty deactivating a channel > >> while > >> > it is active. > >> > Switched to nio and it works fine > >> > --conf spark.shuffle.blockTransferService=nio > >> > > >> > On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee <hector....@gmail.com> > >> wrote: > >> > > >> >> I'm still seeing the fetch failed error and updated > >> >> https://issues.apache.org/jira/browse/SPARK-3633 > >> >> > >> >> On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin < > van...@cloudera.com> > >> >> wrote: > >> >> > >> >>> +1 (non-binding) > >> >>> > >> >>> . ran simple things on spark-shell > >> >>> . ran jobs in yarn client & cluster modes, and standalone cluster > mode > >> >>> > >> >>> On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or <and...@databricks.com> > >> wrote: > >> >>>> Please vote on releasing the following candidate as Apache Spark > >> version > >> >>>> 1.1.1. > >> >>>> > >> >>>> This release fixes a number of bugs in Spark 1.1.0. Some of the > >> notable > >> >>> ones > >> >>>> are > >> >>>> - [SPARK-3426] Sort-based shuffle compression settings are > >> incompatible > >> >>>> - [SPARK-3948] Stream corruption issues in sort-based shuffle > >> >>>> - [SPARK-4107] Incorrect handling of Channel.read() led to data > >> >>> truncation > >> >>>> The full list is at http://s.apache.org/z9h and in the CHANGES.txt > >> >>> attached. > >> >>>> > >> >>>> Additionally, this candidate fixes two blockers from the previous > RC: > >> >>>> - [SPARK-4434] Cluster mode jar URLs are broken > >> >>>> - [SPARK-4480][SPARK-4467] Too many open files exception from > shuffle > >> >>> spills > >> >>>> > >> >>>> The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d): > >> >>>> http://s.apache.org/p8 > >> >>>> > >> >>>> The release files, including signatures, digests, etc can be found > >> at: > >> >>>> http://people.apache.org/~andrewor14/spark-1.1.1-rc2/ > >> >>>> > >> >>>> Release artifacts are signed with the following key: > >> >>>> https://people.apache.org/keys/committer/andrewor14.asc > >> >>>> > >> >>>> The staging repository for this release can be found at: > >> >>>> > >> https://repository.apache.org/content/repositories/orgapachespark-1043/ > >> >>>> > >> >>>> The documentation corresponding to this release can be found at: > >> >>>> http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/ > >> >>>> > >> >>>> Please vote on releasing this package as Apache Spark 1.1.1! > >> >>>> > >> >>>> The vote is open until Saturday, November 22, at 23:00 UTC and > >> passes if > >> >>>> a majority of at least 3 +1 PMC votes are cast. > >> >>>> [ ] +1 Release this package as Apache Spark 1.1.1 > >> >>>> [ ] -1 Do not release this package because ... > >> >>>> > >> >>>> To learn more about Apache Spark, please see > >> >>>> http://spark.apache.org/ > >> >>>> > >> >>>> Cheers, > >> >>>> Andrew > >> >>>> > >> >>>> > >> >>>> > --------------------------------------------------------------------- > >> >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >> >>>> For additional commands, e-mail: dev-h...@spark.apache.org > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> Marcelo > >> >>> > >> >>> > --------------------------------------------------------------------- > >> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >> >>> For additional commands, e-mail: dev-h...@spark.apache.org > >> >>> > >> >>> > >> >> > >> >> > >> >> -- > >> >> Yee Yang Li Hector <http://google.com/+HectorYee> > >> >> *google.com/+HectorYee <http://google.com/+HectorYee>* > >> >> > >> > > >> > > >> > > >> > -- > >> > Yee Yang Li Hector <http://google.com/+HectorYee> > >> > *google.com/+HectorYee <http://google.com/+HectorYee>* > >> > >> > > > > > > -- > > Yee Yang Li Hector <http://google.com/+HectorYee> > > *google.com/+HectorYee <http://google.com/+HectorYee>* > > > > > > > > > -- > Yee Yang Li Hector <http://google.com/+HectorYee> > *google.com/+HectorYee <http://google.com/+HectorYee>* >