Am Montag, 25. November 2013, 12:33:25 schrieb abhishek:
> a simple way of cleaning the html tags is using NLTK's "clean_html"

Hey,

thx, didn't know about that.

Just for information: this is now be done by BeautifulSoup: 
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

It will solve a part of the problem. Right now I'm still a layer down 
(encapsulation etc.) at RFC 2822 stuff.

Python's email package does not work as automatically as I wish.

Regards,
Florian
 
> On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler 
<jaquesgrob...@gmail.com>wrote:
> > Hey Florian,
> > 
> > So you need some lexical analyzer to remove all the HTML tags etc before
> > you start your classification?
> > I'm not sure about any ready-to-use packages for this (I'm sure they're
> > out there),
> > but I've played around with pythons `re` module at some point and now
> > found this which might be useful to you, if you want to make your own
> > lexical analyzer for your purposes.
> > 
> > http://www.gooli.org/blog/a-simple-lexer-in-python/
> > 
> > Anyway I hope this is helpful in some way.
> > 
> > Good luck and kind Regards,
> > Jaq
> > 
> > 
> > 2013/11/24 Florian Lindner <mailingli...@xgm.de>
> > 
> >> Hello,
> >> 
> >> I want to use scikit-lean for mail classification (no spam detection). I
> >> haven't really worked with machine learning software (besides end-user
> >> spamfilters).
> >> 
> >> What I have done so far:
> >> 
> >> vectorizer = TfidfVectorizer(input='filename',
> >> preprocessor=mail_preprocessor,
> >> decode_error="ignore")
> >> X = vectorizer.fit_transform(["testmail2"])
> >> 
> >> testmail2 is raw email message (taken from a servers maildir), The
> >> decode_error I've set due to utf8 decoding issues that I decided to
> >> ignore for
> >> the time being.
> >> 
> >> This works perfectly for the scikit-learn part. But one challenge (for
> >> me)
> >> seems to be to prepare the mail for feature extraction.
> >> 
> >> My idea would be to take the plain/text parts of the mails, maybe
> >> additionally
> >> the From header.
> >> 
> >> def mail_preprocessor(str):
> >>     msg = email.message_from_string(str)
> >>     msg_body = ""
> >>     
> >>     for part in msg.walk():
> >>         if part.get_content_type() == "text/plain":
> >>             msg_body += part.get_payload(decode=True)
> >>     
> >>     msg_body = msg_body.lower()
> >>     msg_body = msg_body.replace("\n", " ")
> >>     msg_body = msg_body.replace("\t", " ")
> >>     return msg_body
> >> 
> >> I know that this may be slightly offtopic and I apologize if it's too
> >> offtopic.
> >> 
> >> Is there already some code in the wild that prepares mail messages for
> >> feature
> >> extraction? The topic seems to be much more fancy then I had suspected,
> >> regarding issues like HTML, MIME encodings, multipart stuff, ...
> >> 
> >> Thanks!
> >> 
> >> Florian
> >> 
> >> 
> >> -------------------------------------------------------------------------
> >> ----- Shape the Mobile Experience: Free Subscription
> >> Software experts and developers: Be at the forefront of tech innovation.
> >> Intel(R) Software Adrenaline delivers strategic insight and game-changing
> >> conversations that shape the rapidly evolving mobile landscape. Sign up
> >> now.
> >> 
> >> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clkt
> >> rk
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> > 
> > --------------------------------------------------------------------------
> > ---- Shape the Mobile Experience: Free Subscription
> > Software experts and developers: Be at the forefront of tech innovation.
> > Intel(R) Software Adrenaline delivers strategic insight and game-changing
> > conversations that shape the rapidly evolving mobile landscape. Sign up
> > now.
> > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktr
> > k
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to