Hey Florian,
So you need some lexical analyzer to remove all the HTML tags etc before
you start your classification?
I'm not sure about any ready-to-use packages for this (I'm sure they're out
there),
but I've played around with pythons `re` module at some point and now found
this which might be useful to you, if you want to make your own lexical
analyzer for your purposes.
http://www.gooli.org/blog/a-simple-lexer-in-python/
Anyway I hope this is helpful in some way.
Good luck and kind Regards,
Jaq
2013/11/24 Florian Lindner <mailingli...@xgm.de>
> Hello,
>
> I want to use scikit-lean for mail classification (no spam detection). I
> haven't really worked with machine learning software (besides end-user
> spamfilters).
>
> What I have done so far:
>
> vectorizer = TfidfVectorizer(input='filename',
> preprocessor=mail_preprocessor,
> decode_error="ignore")
> X = vectorizer.fit_transform(["testmail2"])
>
> testmail2 is raw email message (taken from a servers maildir), The
> decode_error I've set due to utf8 decoding issues that I decided to ignore
> for
> the time being.
>
> This works perfectly for the scikit-learn part. But one challenge (for me)
> seems to be to prepare the mail for feature extraction.
>
> My idea would be to take the plain/text parts of the mails, maybe
> additionally
> the From header.
>
> def mail_preprocessor(str):
> msg = email.message_from_string(str)
> msg_body = ""
> for part in msg.walk():
> if part.get_content_type() == "text/plain":
> msg_body += part.get_payload(decode=True)
> msg_body = msg_body.lower()
> msg_body = msg_body.replace("\n", " ")
> msg_body = msg_body.replace("\t", " ")
> return msg_body
>
> I know that this may be slightly offtopic and I apologize if it's too
> offtopic.
>
> Is there already some code in the wild that prepares mail messages for
> feature
> extraction? The topic seems to be much more fancy then I had suspected,
> regarding issues like HTML, MIME encodings, multipart stuff, ...
>
> Thanks!
>
> Florian
>
>
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing
> conversations that shape the rapidly evolving mobile landscape. Sign up
> now.
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general