On 05/23/2015 04:15 PM, savitha devi wrote:
What I exactly want is the java script is in the html code. I am trying for
a regular expression to find the email address embedded with in the java
script.
On Sat, May 23, 2015 at 2:31 PM, Chris Angelico <[email protected]> wrote:
On Sat, May 23, 2015 at 4:46 PM, savitha devi <[email protected]> wrote:
I am developing a web scraper code using HTMLParser. I need to extract
text/email address from java script with in the HTMLCode.I am beginner
level
in python coding and totally lost here. Need some help on this. The java
script code is as below:
<script type='text/javascript'>
//<!--
document.getElementById('cloak48218').innerHTML = '';
var prefix = 'ma' + 'il' + 'to';
var path = 'hr' + 'ef' + '=';
var addy48218 = 'info' + '@';
addy48218 = addy48218 + 'tsv-neuried' + '.' +
'de';
document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
//-->
This is deliberately being done to prevent scripted usage. What
exactly are you needing to do this for?
You're basically going to have to execute the entire block of
JavaScript code, and then decode the entities to get to what you want.
Doing it manually is pretty easy; doing it automatically will
virtually require a language interpreter.
ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
This is just about nuts and bolts, not about the ethics of presumed
intentions.
Hope it helps one way or other
Frederic
-------------------------------------------------------------------------------
sample = '''//<!--
document.getElementById('cloak48218').innerHTML = '';
var prefix = 'ma' + 'il' + 'to';
var path = 'hr' + 'ef' + '=';
var addy48218 = 'info' + '@';
addy48218 = addy48218 + 'tsv-neuried' + '.' +
'de';
document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
prefix + ':' + addy48218 + ''>' + addy48218+'<\/a>';
//-->'''
>>> import SE # Download from PyPi at https://pypi.python.org/pypi/SE
>>> def make_se_translator ():
# Make SE substitutions
subs_list = []
# Make &# code substitutions
for i in range (256):
subs_list.append ('&#%d;=%c' % (i, chr(i)))
# Delete Java stuff
subs_list.append (' "document.getElementById(\'cloak48218\').=" ')
subs_list.append (' "var =" "\n=" //<!--= //-->= ')
# Java syntax? Tweaks needed to get the sample working
subs_list.append (' "+ \'\'\'=" \'\'>\'=\'>\' <\/=</ ')
# Add more as needed trial and error style
# subs_list.append ( . . . format: ' old=new "delete this=" '
# Make text
subs = '\n'.join (subs_list)
# Make SE translator
translator = SE.SE (subs)
# return translator, subs # print subs, if you want to see what
they look like
return translator
>>> translator = make_se_translator ()
>>> translation = translator (sample)
>>> print translation # See
innerHTML = ''; prefix = 'ma' + 'il' + 'to'; path = 'hr' + 'ef' + '=';
addy48218 = 'info' + '@'; addy48218 = addy48218 + 'tsv-neuried' + '.'
+'de'; innerHTML += '<a ' + path +prefix + ':' + addy48218 + '>' +
addy48218+'</a>';
>>> exec (translation.lstrip ())
>>> print innerHTML
<a href=mailto:[email protected]>[email protected]</a>
--
https://mail.python.org/mailman/listinfo/python-list