Thanks for the feedback. For 1, there is an open patch: https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but they're faster to access otherwise.
Matei On Oct 9, 2014, at 12:11 PM, Guillaume Pitel <guillaume.pi...@exensa.com> wrote: > Hi, > > Thanks to your answer, we've found the problem. It was on reverse IP > resolution on the drivers we used (wrong configuration of the local bind9). > Apparently, not being able to reverse-resolve the IP address of the nodes was > the culprit of the 10s delay. > > We've hit two other secondary problems with TorrentBroadcast though, in case > you're interested : > > 1 - Broadcasting a variable of about 2GB (1.8GB exactly) triggers a > "java.lang.OutOfMemoryError: Requested array size exceeds VM limit", which is > not the case with HttpBroadcast (I guess HttpBroadcast splits the serialized > variable in small chunks) > 2 - Memory use of Torrent seems to be higher than Http (i.e. switching from > Http to Torrent triggers several OOM). > > Additionally, a question : while HttpBroadcast stores the broadcast pieces on > disk (in spark.local.dir/spark-... ), TorrentBroadcast seems not to use disk > backend storage. Does it mean that HttpBroadcast can handle bigger broadcast > out of memory ? If so, it's too bad that this design choice wasn't used for > Torrent. > > That being said, hats off to the people in charge of the broadcast unloading > wrt the lineage, this stuff works great ! > > Guillaume > > >> Maybe there is a firewall issue that makes it slow for your nodes to connect >> through the IP addresses they're configured with. I see there's this 10 >> second pause between "Updated info of block broadcast_84_piece1" and >> "ensureFreeSpace(4194304) called" (where it actually receives the block). >> HTTP broadcast used only HTTP fetches from the executors to the driver, but >> TorrentBroadcast has connections between the executors themselves and >> between executors and the driver over a different port. Where are you >> running your driver app and nodes? >> >> Matei >> >> On Oct 7, 2014, at 11:42 AM, Davies Liu <dav...@databricks.com> wrote: >> >>> Could you create a JIRA for it? maybe it's a regression after >>> https://issues.apache.org/jira/browse/SPARK-3119. >>> >>> We will appreciate that if you could tell how to reproduce it. >>> >>> On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel >>> <guillaume.pi...@exensa.com> wrote: >>>> Hi, >>>> >>>> I've had no answer to this on u...@spark.apache.org, so I post it on dev >>>> before filing a JIRA (in case the problem or solution is already >>>> identified) >>>> >>>> We've had some performance issues since switching to 1.1.0, and we finally >>>> found the origin : TorrentBroadcast seems to be very slow in our setting >>>> (and it became default with 1.1.0) >>>> >>>> The logs of a 4MB variable with TorrentBroadcast : (15s) >>>> >>>> 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 >>>> stored >>>> as bytes in memory (estimated size 171.6 KB, free 7.2 GB) >>>> 14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block >>>> broadcast_84_piece1 >>>> 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) called >>>> with curMem=1401611984, maxMem=9168696115 >>>> 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 >>>> stored >>>> as bytes in memory (estimated size 4.0 MB, free 7.2 GB) >>>> 14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block >>>> broadcast_84_piece0 >>>> 14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast >>>> variable 84 took 15.202260006 s >>>> 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) called >>>> with curMem=1405806288, maxMem=9168696115 >>>> 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as >>>> values in memory (estimated size 4.2 MB, free 7.2 GB) >>>> >>>> (notice that a 10s lag happens after the "Updated info of block >>>> broadcast_..." and before the MemoryStore log >>>> >>>> And with HttpBroadcast (0.3s): >>>> >>>> 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast >>>> variable 147 >>>> 14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) called >>>> with curMem=1373493232, maxMem=9168696115 >>>> 14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as >>>> values in memory (estimated size 4.2 MB, free 7.3 GB) >>>> 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable >>>> 147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found >>>> block broadcast_147 locally >>>> >>>> Since Torrent is supposed to perform much better than Http, we suspect a >>>> configuration error from our side, but are unable to pin it down. Does >>>> someone have any idea of the origin of the problem ? >>>> >>>> For now we're sticking with the HttpBroadcast workaround. >>>> >>>> Guillaume >>>> -- >>>> Guillaume PITEL, Président >>>> +33(0)626 222 431 >>>> >>>> eXenSa S.A.S. >>>> 41, rue Périer - 92120 Montrouge - FRANCE >>>> Tel +33(0)184 163 677 / Fax +33(0)972 283 705 >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org >