I have written a ton of HTML scrapers in my life. The best technique I found is 
to strip out all HTML tags first and regex on the text only.

So you get HTML like this:
    
    
    <div class="_7sjd"><i class="_7sjb img sp_BqX-Srs8YcK 
sx_b851c2"><u>messages</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div 
class="_6a">Messages</div></div><div class="_1d6k _7sjd"><span class="_51lp 
_5ugf _5ugh" id="u_fetchstream_1_6">1</span></div></a></div><div 
class="_1d6i"><a 
href="/ads/growth/aymt/homepage/panel/redirect/?data=%7B%22selected_object_id%22%3A1550952275230845%2C%22is_collapsed%22%3A0%2C%22object_ids%22%3A%5B1550952275230845%5D%2C%22section%22%3A%22Header+Section%22%2C%22clicked_target%22%3A%22Selected+Page+Notifications+Count%22%2C%22event%22%3A%22click%22%7D&redirect_url=%2Fistrolid%2Fmanager%2F"><div
 class="_7sjd"><i class="_7sjb img sp_MHk6M-gfm5c 
sx_d38e37"><u>globe-americas</u></i></div><div class="_1d6j _7sjd fsm fwn 
fcg"><div class="_6a">Notifications</div></div><div class="_1d6k _7sjd"><span 
class="_51lp _51lr _5ugf _5ugh" 
id="u_fetchstream_1_7">20+</span></div></a></div>
    
    Run

Its almost always best to just strip out all HTML tags and get this:
    
    
       messages     Messages    1         globe-americas     Notifications    
20+
    
    
    Run

Then it becomes trivial to regex for message count and notification count.

See example code: 
    
    
    import re
    
    var s = """<div class="_7sjd"><i class="_7sjb img sp_BqX-Srs8YcK 
sx_b851c2"><u>messages</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div 
class="_6a">Messages</div></div><div class="_1d6k _7sjd"><span class="_51lp 
_5ugf _5ugh" id="u_fetchstream_1_6">1</span></div></a></div><div 
class="_1d6i"><a 
href="/ads/growth/aymt/homepage/panel/redirect/?data=%7B%22selected_object_id%22%3A1550952275230845%2C%22is_collapsed%22%3A0%2C%22object_ids%22%3A%5B1550952275230845%5D%2C%22section%22%3A%22Header+Section%22%2C%22clicked_target%22%3A%22Selected+Page+Notifications+Count%22%2C%22event%22%3A%22click%22%7D&redirect_url=%2Fistrolid%2Fmanager%2F"><div
 class="_7sjd"><i class="_7sjb img sp_MHk6M-gfm5c 
sx_d38e37"><u>globe-americas</u></i></div><div class="_1d6j _7sjd fsm fwn 
fcg"><div class="_6a">Notifications</div></div><div class="_1d6k _7sjd"><span 
class="_51lp _51lr _5ugf _5ugh" 
id="u_fetchstream_1_7">20+</span></div></a></div>"""
    
    s = re.replace(s, re"<[^>]*>", " ")
    
    echo s
    
    echo findAll(s, re"Messages\s*\d*")
    echo findAll(s, re"Notifications\s*\d*")
    
    
    Run

Reply via email to