a simple way of cleaning the html tags is using NLTK's "clean_html"
On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler <jaquesgrob...@gmail.com>wrote:
> Hey Florian,
>
> So you need some lexical analyzer to remove all the HTML tags etc before
> you start your classification?
> I'm not sure about any ready-to-use packages for this (I'm sure they're
> out there),
> but I've played around with pythons `re` module at some point and now
> found this which might be useful to you, if you want to make your own
> lexical analyzer for your purposes.
>
> http://www.gooli.org/blog/a-simple-lexer-in-python/
>
> Anyway I hope this is helpful in some way.
>
> Good luck and kind Regards,
> Jaq
>
>
> 2013/11/24 Florian Lindner <mailingli...@xgm.de>
>
>> Hello,
>>
>> I want to use scikit-lean for mail classification (no spam detection). I
>> haven't really worked with machine learning software (besides end-user
>> spamfilters).
>>
>> What I have done so far:
>>
>> vectorizer = TfidfVectorizer(input='filename',
>> preprocessor=mail_preprocessor,
>> decode_error="ignore")
>> X = vectorizer.fit_transform(["testmail2"])
>>
>> testmail2 is raw email message (taken from a servers maildir), The
>> decode_error I've set due to utf8 decoding issues that I decided to
>> ignore for
>> the time being.
>>
>> This works perfectly for the scikit-learn part. But one challenge (for me)
>> seems to be to prepare the mail for feature extraction.
>>
>> My idea would be to take the plain/text parts of the mails, maybe
>> additionally
>> the From header.
>>
>> def mail_preprocessor(str):
>> msg = email.message_from_string(str)
>> msg_body = ""
>> for part in msg.walk():
>> if part.get_content_type() == "text/plain":
>> msg_body += part.get_payload(decode=True)
>> msg_body = msg_body.lower()
>> msg_body = msg_body.replace("\n", " ")
>> msg_body = msg_body.replace("\t", " ")
>> return msg_body
>>
>> I know that this may be slightly offtopic and I apologize if it's too
>> offtopic.
>>
>> Is there already some code in the wild that prepares mail messages for
>> feature
>> extraction? The topic seems to be much more fancy then I had suspected,
>> regarding issues like HTML, MIME encodings, multipart stuff, ...
>>
>> Thanks!
>>
>> Florian
>>
>>
>> ------------------------------------------------------------------------------
>> Shape the Mobile Experience: Free Subscription
>> Software experts and developers: Be at the forefront of tech innovation.
>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>> conversations that shape the rapidly evolving mobile landscape. Sign up
>> now.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing
> conversations that shape the rapidly evolving mobile landscape. Sign up
> now.
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Regards
Abhishek Thakur
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general