@Florian - Abhishek's suggestion is the way to go. Simple and works well [?]


2013/11/25 abhishek <abhish...@gmail.com>

> a simple way of cleaning the html tags is using NLTK's "clean_html"
>
>
> On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler 
> <jaquesgrob...@gmail.com>wrote:
>
>> Hey Florian,
>>
>> So you need some lexical analyzer to remove all the HTML tags etc before
>> you start your classification?
>> I'm not sure about any ready-to-use packages for this (I'm sure they're
>> out there),
>> but I've played around with pythons `re` module at some point and now
>> found this which might be useful to you, if you want to make your own
>> lexical analyzer for your purposes.
>>
>> http://www.gooli.org/blog/a-simple-lexer-in-python/
>>
>> Anyway I hope this is helpful in some way.
>>
>> Good luck and kind Regards,
>> Jaq
>>
>>
>> 2013/11/24 Florian Lindner <mailingli...@xgm.de>
>>
>>> Hello,
>>>
>>> I want to use scikit-lean for mail classification (no spam detection). I
>>> haven't really worked with machine learning software (besides end-user
>>> spamfilters).
>>>
>>> What I have done so far:
>>>
>>> vectorizer = TfidfVectorizer(input='filename',
>>> preprocessor=mail_preprocessor,
>>> decode_error="ignore")
>>> X = vectorizer.fit_transform(["testmail2"])
>>>
>>> testmail2 is raw email message (taken from a servers maildir), The
>>> decode_error I've set due to utf8 decoding issues that I decided to
>>> ignore for
>>> the time being.
>>>
>>> This works perfectly for the scikit-learn part. But one challenge (for
>>> me)
>>> seems to be to prepare the mail for feature extraction.
>>>
>>> My idea would be to take the plain/text parts of the mails, maybe
>>> additionally
>>> the From header.
>>>
>>> def mail_preprocessor(str):
>>>     msg = email.message_from_string(str)
>>>     msg_body = ""
>>>     for part in msg.walk():
>>>         if part.get_content_type() == "text/plain":
>>>             msg_body += part.get_payload(decode=True)
>>>     msg_body = msg_body.lower()
>>>     msg_body = msg_body.replace("\n", " ")
>>>     msg_body = msg_body.replace("\t", " ")
>>>     return msg_body
>>>
>>> I know that this may be slightly offtopic and I apologize if it's too
>>> offtopic.
>>>
>>> Is there already some code in the wild that prepares mail messages for
>>> feature
>>> extraction? The topic seems to be much more fancy then I had suspected,
>>> regarding issues like HTML, MIME encodings, multipart stuff, ...
>>>
>>> Thanks!
>>>
>>> Florian
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Shape the Mobile Experience: Free Subscription
>>> Software experts and developers: Be at the forefront of tech innovation.
>>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>>> conversations that shape the rapidly evolving mobile landscape. Sign up
>>> now.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Shape the Mobile Experience: Free Subscription
>> Software experts and developers: Be at the forefront of tech innovation.
>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>> conversations that shape the rapidly evolving mobile landscape. Sign up
>> now.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Regards
>
> Abhishek Thakur
>
>
>
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing
> conversations that shape the rapidly evolving mobile landscape. Sign up
> now.
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

<<B97.gif>>

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to