Am Montag, 25. November 2013, 12:33:25 schrieb abhishek: > a simple way of cleaning the html tags is using NLTK's "clean_html"
Hey, thx, didn't know about that. Just for information: this is now be done by BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text It will solve a part of the problem. Right now I'm still a layer down (encapsulation etc.) at RFC 2822 stuff. Python's email package does not work as automatically as I wish. Regards, Florian > On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler <jaquesgrob...@gmail.com>wrote: > > Hey Florian, > > > > So you need some lexical analyzer to remove all the HTML tags etc before > > you start your classification? > > I'm not sure about any ready-to-use packages for this (I'm sure they're > > out there), > > but I've played around with pythons `re` module at some point and now > > found this which might be useful to you, if you want to make your own > > lexical analyzer for your purposes. > > > > http://www.gooli.org/blog/a-simple-lexer-in-python/ > > > > Anyway I hope this is helpful in some way. > > > > Good luck and kind Regards, > > Jaq > > > > > > 2013/11/24 Florian Lindner <mailingli...@xgm.de> > > > >> Hello, > >> > >> I want to use scikit-lean for mail classification (no spam detection). I > >> haven't really worked with machine learning software (besides end-user > >> spamfilters). > >> > >> What I have done so far: > >> > >> vectorizer = TfidfVectorizer(input='filename', > >> preprocessor=mail_preprocessor, > >> decode_error="ignore") > >> X = vectorizer.fit_transform(["testmail2"]) > >> > >> testmail2 is raw email message (taken from a servers maildir), The > >> decode_error I've set due to utf8 decoding issues that I decided to > >> ignore for > >> the time being. > >> > >> This works perfectly for the scikit-learn part. But one challenge (for > >> me) > >> seems to be to prepare the mail for feature extraction. > >> > >> My idea would be to take the plain/text parts of the mails, maybe > >> additionally > >> the From header. > >> > >> def mail_preprocessor(str): > >> msg = email.message_from_string(str) > >> msg_body = "" > >> > >> for part in msg.walk(): > >> if part.get_content_type() == "text/plain": > >> msg_body += part.get_payload(decode=True) > >> > >> msg_body = msg_body.lower() > >> msg_body = msg_body.replace("\n", " ") > >> msg_body = msg_body.replace("\t", " ") > >> return msg_body > >> > >> I know that this may be slightly offtopic and I apologize if it's too > >> offtopic. > >> > >> Is there already some code in the wild that prepares mail messages for > >> feature > >> extraction? The topic seems to be much more fancy then I had suspected, > >> regarding issues like HTML, MIME encodings, multipart stuff, ... > >> > >> Thanks! > >> > >> Florian > >> > >> > >> ------------------------------------------------------------------------- > >> ----- Shape the Mobile Experience: Free Subscription > >> Software experts and developers: Be at the forefront of tech innovation. > >> Intel(R) Software Adrenaline delivers strategic insight and game-changing > >> conversations that shape the rapidly evolving mobile landscape. Sign up > >> now. > >> > >> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clkt > >> rk > >> _______________________________________________ > >> Scikit-learn-general mailing list > >> Scikit-learn-general@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > -------------------------------------------------------------------------- > > ---- Shape the Mobile Experience: Free Subscription > > Software experts and developers: Be at the forefront of tech innovation. > > Intel(R) Software Adrenaline delivers strategic insight and game-changing > > conversations that shape the rapidly evolving mobile landscape. Sign up > > now. > > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktr > > k > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general