2011/3/8 Cross <x...@x.tv>: > On 03/08/2011 06:09 PM, Heather Brown wrote: >> >> The keywords are an attribute in a tag called <meta>, in the section >> called >> <head>. Are you having trouble parsing the xhtml to that point? >> >> Be more specific in your question, and somebody is likely to chime in. >> Although >> I'm not the one, if it's a question of parsing the xhtml. >> >> DaveA > > I know meta tags contain keywords but they are not always reliable. I can > parse xhtml to obtain keywords from meta tags; but how do I verify them. To > obtain reliable keywords, I have to parse the plain text obtained from the > URL. > > Cross > > --- news://freenews.netfront.net/ - complaints: n...@netfront.net --- > -- > http://mail.python.org/mailman/listinfo/python-list >
Hi, if you need to extract meaningful keywords in terms of data mining using natural language processing, it might become quite a complex task, depending on the requirements; the NLTK toolkit may help with some approaches [ http://www.nltk.org/ ]. One possibility would be to filter out more frequent and less meaningful words ("stopwords") and extract the more frequent words from the reminder., e.g. (with some simplifications/hacks in the interactive mode): >>> import re, urllib2, nltk >>> page_src = >>> urllib2.urlopen("http://www.python.org/doc/essays/foreword/").read().decode("utf-8") >>> page_plain = nltk.clean_html(page_src).lower() >>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", >>> page_plain) if word not in set(nltk.corpus.stopwords.words("english")))) >>> frequency_dist = nltk.FreqDist(txt_filtered) >>> [(word, freq) for (word, freq) in frequency_dist.items() if freq > 2] [(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7), (u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5), (u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4), (u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4), (u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3), (u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help', 3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3), (u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability', 3), (u'readable', 3), (u'write', 3)] >>> Another possibility would be to extract parts of speech (e.g. nouns, adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.; for more convoluted html code e.g. BeautifulSoup might be used and there are likely many other options. hth, vbr -- http://mail.python.org/mailman/listinfo/python-list