> conn.log because the logger aggregated the records from all the workers. If > the manager > is running out of memory tracking all of the scanners, then the single > instance of the python > script is going to run into the same issue at some point.
Oh totally forgot to add an important point. Issue isn't memory to begin with. Issue was 2Million entires in a table can result in expire_function to be slow clogging the event queue resulting in stall network_time and manager lagging significantly behind. Holding potential scanners over workers helps me break manager table down to much smaller size across workers. > How are you tracking slow scanners on the workers? If you have 50 workers > and you > are not distributing the data between them, there's only a 1 in 50 chance > that you'll Exactly, thats what keeps worker table sizes small. until there are enough connections to flag something as scanner. Note: worker_tables only keep IP->start_time They report to manager a potential scanner but manager doesn't need to keep that in table. I think I use combination of bloomfilter and hyperloglog there to scale to Millions easily. Note2: this is to generate scan_summary and not a scanner. this thing: #fields ts scanner state detection start_ts end_ts detect_ts detect_latency total_conn total_hosts_scanned duration scan_rate country_code region city distance event_peer #types time addr enum string time time time interval count count interval double string string string double string 1522090899.384438 122.224.112.162 Scan::DETECT LandMine 1522090219.816163 1522090744.317532 1522090744.317532 524.501369 19 19 524.501369 27.605335 CN 02 Hangzhou 6243.450744 bro 1522094550.487969 122.224.112.162 Scan::UPDATE LandMine 1522090219.816163 1522094128.634672 1522090744.317532 524.501369 110 109 3908.818509 35.86072 CN 02 Hangzhou 6243.450744 bro 1522098156.871486 122.224.112.162 Scan::UPDATE LandMine 1522090219.816163 1522097984.861365 1522090744.317532 524.501369 225 227 7765.045202 34.207248 CN 02 Hangzhou 6243.450744 bro 1522101784.996068 122.224.112.162 Scan::UPDATE LandMine 1522090219.816163 1522101081.946002 1522090744.317532 524.501369 359 365 10862.129839 29.75926 CN 02 Hangzhou 6243.450744 bro 1522354414.224635 122.224.112.162 Scan::FINISH LandMine 1522090219.816163 1522103520.055414 1522090744.317532 524.501369 488 507 13300.239251 26.233214 CN 02 Hangzhou 6243.450744 bro Aashish On Wed, Apr 18, 2018 at 01:46:08PM +0000, Azoff, Justin S wrote: > > On Apr 17, 2018, at 4:04 PM, Aashish Sharma <asha...@lbl.gov> wrote: > > > > For now, I am resorting to &expire_func route only. I think by using some > > more > > heuristics in worker's expire functions for more aggregated stats, I am > > able to > > shed load on manager where manager doesn't need to track ALL potential > > scanners. > > > > Lets see, I am running to see if new code works without exhausting memory > > for few days. > > > > Yes certainly, the following changed did address the manager network_time() > > stall issues: > > > > redef table_expire_interval = 0.1 secs ; > > redef table_incremental_step=25 ; > > > > Useful observation: if you want to expire a lot of entires from a > > table/set, > > expire few but expire often. > > > > I Still need to determine limits of both table_incremental_step, > > table_expire_interval and this works for million or million(s) of entires. > > That should probably work unless you are adding new table entries at a rate > higher than 250/second. > You may need to tweak that so interval x step is at least the rate of new > entries. > > > > > Yes, It seems like that. I still don't know at what point. In previous runs > > it > > appears after table had 1.7-2.3 Million entires. But then I don't think its > > function of counts, but how much RAM i've got on the system. Somewhere in > > the > > range is when manager ran out of memory. HOwever (as stated above), I was > > able > > to come up with a little heuristics which still allows me to keep track of > > really slow scanners, while not burdening manager but rather let load be on > > workers. Simple observation that really slow scanners aren't going to have > > a lot > > of connections allows to keep those in (few) worker table. This would > > potentially be a problem if there really a LOT of very slow scanners. but, > > still, those all get divided by number of workers we run. > > How are you tracking slow scanners on the workers? If you have 50 workers > and you > are not distributing the data between them, there's only a 1 in 50 chance > that you'll > see the same scanner twice on the same worker, and a one in 2500 that you'd > see > 3 packets in a row on the same worker... and 1:125,000 for 4 in a row. > > >> I'd suggest a different way to think about > >> structuring the problem: you could Rendezvous Hash the IP addresses across > >> proxies, with each one managing expiration in just their own table. In > >> that > >> way, the storage/computation can be uniformly distributed and you should be > >> able to simply adjust number of proxies to fit the required scale. > > > > I think above might work reasonable. > > It does, it works great. > > > So previously I was making manager keep count of potential scanners but now > > moving that work instead to workers. New model would let us just move all > > this > > to proxy(ies) and proxies can decide if delete or send to manager for > > aggregation. > > There's no need, you just aggregate it on the proxies, the manager never sees > anything. > > > I suppose, given proxies don't process packets, it will be cheaper there to > > do > > all this work. > > > > Only thing bothers me is scan-detection is a complicated problem only > > because of > > distribution of data in cluster. Its a lot simple problem if we could just > > do a > > tail -f conn.log | ./python-script > > That doesn't simplify anything, that just moves the problem. You can only > tail the single > conn.log because the logger aggregated the records from all the workers. If > the manager > is running out of memory tracking all of the scanners, then the single > instance of the python > script is going to run into the same issue at some point. > > > So yes we can shed load from manger -> workers -> proxies. I'll try this > > approach. But I think I am also going to try (with new broker-enabled > > cluster) > > approach of sending all connections to one proxy/data-store and just do > > aggregation there and see if that works out (the tail -f conn.log | > > 'python-script' approach). Admittedly, this needs more thinking to get the > > right > > architecture in the new cluster era! > > No.. this is just moving the problem again. If your manager is running out > of memory and you > move everything to one proxy, that's just going to have the same problem. > > The fix is to use the distributing message routing features that I've been > talking about for a while > (and that Jon implemented in the actor-system branch!) > > The entire change to switch simple-scan from aggregating all scanners on a > single manager to > aggregating scanners across all proxies (which can be on multiple machines) > is swapping > > event Scan::scan_attempt(scanner, attempt); > > with > > Cluster::publish_hrw(Cluster::proxy_pool, scanner, > Scan::scan_attempt, scanner, attempt); > > (with some @ifdefs to make it work on both versions of bro) > > > — > Justin Azoff > > _______________________________________________ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev