@Florian - Abhishek's suggestion is the way to go. Simple and works well [?]
2013/11/25 abhishek <abhish...@gmail.com> > a simple way of cleaning the html tags is using NLTK's "clean_html" > > > On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler > <jaquesgrob...@gmail.com>wrote: > >> Hey Florian, >> >> So you need some lexical analyzer to remove all the HTML tags etc before >> you start your classification? >> I'm not sure about any ready-to-use packages for this (I'm sure they're >> out there), >> but I've played around with pythons `re` module at some point and now >> found this which might be useful to you, if you want to make your own >> lexical analyzer for your purposes. >> >> http://www.gooli.org/blog/a-simple-lexer-in-python/ >> >> Anyway I hope this is helpful in some way. >> >> Good luck and kind Regards, >> Jaq >> >> >> 2013/11/24 Florian Lindner <mailingli...@xgm.de> >> >>> Hello, >>> >>> I want to use scikit-lean for mail classification (no spam detection). I >>> haven't really worked with machine learning software (besides end-user >>> spamfilters). >>> >>> What I have done so far: >>> >>> vectorizer = TfidfVectorizer(input='filename', >>> preprocessor=mail_preprocessor, >>> decode_error="ignore") >>> X = vectorizer.fit_transform(["testmail2"]) >>> >>> testmail2 is raw email message (taken from a servers maildir), The >>> decode_error I've set due to utf8 decoding issues that I decided to >>> ignore for >>> the time being. >>> >>> This works perfectly for the scikit-learn part. But one challenge (for >>> me) >>> seems to be to prepare the mail for feature extraction. >>> >>> My idea would be to take the plain/text parts of the mails, maybe >>> additionally >>> the From header. >>> >>> def mail_preprocessor(str): >>> msg = email.message_from_string(str) >>> msg_body = "" >>> for part in msg.walk(): >>> if part.get_content_type() == "text/plain": >>> msg_body += part.get_payload(decode=True) >>> msg_body = msg_body.lower() >>> msg_body = msg_body.replace("\n", " ") >>> msg_body = msg_body.replace("\t", " ") >>> return msg_body >>> >>> I know that this may be slightly offtopic and I apologize if it's too >>> offtopic. >>> >>> Is there already some code in the wild that prepares mail messages for >>> feature >>> extraction? The topic seems to be much more fancy then I had suspected, >>> regarding issues like HTML, MIME encodings, multipart stuff, ... >>> >>> Thanks! >>> >>> Florian >>> >>> >>> ------------------------------------------------------------------------------ >>> Shape the Mobile Experience: Free Subscription >>> Software experts and developers: Be at the forefront of tech innovation. >>> Intel(R) Software Adrenaline delivers strategic insight and game-changing >>> conversations that shape the rapidly evolving mobile landscape. Sign up >>> now. >>> >>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >> >> >> >> ------------------------------------------------------------------------------ >> Shape the Mobile Experience: Free Subscription >> Software experts and developers: Be at the forefront of tech innovation. >> Intel(R) Software Adrenaline delivers strategic insight and game-changing >> conversations that shape the rapidly evolving mobile landscape. Sign up >> now. >> >> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > -- > Regards > > Abhishek Thakur > > > > ------------------------------------------------------------------------------ > Shape the Mobile Experience: Free Subscription > Software experts and developers: Be at the forefront of tech innovation. > Intel(R) Software Adrenaline delivers strategic insight and game-changing > conversations that shape the rapidly evolving mobile landscape. Sign up > now. > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
<<B97.gif>>
------------------------------------------------------------------------------ Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general