"MRAB" <goo...@mrett.plus.com> wrote: BJörn Lindqvist wrote: 8< --------------------------- >> For example, to find the email you can use a simple regexp. If there >> is a match you can be certain that that is the authors email. But what >> algorithms can you use to figure out the other information? >> >Tricky! :-) > >How would _you_ recognise them? Have a look at the documents and see if >you can see a pattern. For example, names and address often consist of a >sequence of words in title case, eg "Björn Lindqvist", which might help >you narrow down the list of possibilities. What do telephone numbers >look like, etc?
It may help you to think about the problem if you imagine yourself having to extract the information from documents written in a language that you do not understand. An address may be identified by a number in a line (street address or PO box) that is followed some lines later by another number (zip code). But this hardly qualifies as an "algorithm". A "mailto:" and/or a set of "angle brackets" is a strong clue too... Don't have a clue about the name, though. - plain title case might work for "John Brown" but it fails with "Koos van der Merwe". If there is an email addy in the doc, then it might serve as a clue to where to look - based on the theory that the contact information would be grouped together. Another clue might be to look for the word "Author" or its equivalent in a bunch of languages. "Tricky" is an understatement. - Hendrik -- http://mail.python.org/mailman/listinfo/python-list