Hal Daume III wrote: > p.s., certainly this is at least somewhat unique to me, but almost all of > the data i work with is unstructured text for two reasons. first, that's > how it naturally comes. second, to throw xml or some other scheme on to > it will balloon the data sizes to unmanagable amounts, with little gain.
There's a pretty big gap between *unstructured* text and e.g. XML. Most of what fits into that gap is essentially structured text. If you're performing some kind of processing on the text, the odds are that it does actually have some degree of structure to it. My experience of code which does ad-hoc text processing using regexps or similar is that a lot of it only handles a subset of what it ought to, and that subset is typically defined by the nature of the technique. Some examples of this issue are code which attempts to: + match C-style string literals, but falls down on an embedded \" sequence; + match code tokens, but matches the same sequence of characters when they occur inside string literals; + process email headers, but falls down on folded headers; + process HTML, but falls down in more ways than I could possibly list. Except in the most trivial cases, to process text *reliably* you usually need to at least tokenise it and process the token stream. And anything which has a more complex structure usually needs to operate (at least conceptually) on a parse tree. Regexps certainly have their place, although that's primarily in writing tokenisers. IMHO, try to do everything (or, at least, too much) using s/pattern/replacement/ constructs seems to be a favourite recipe for buggy code. Case in point: the regular occurrence of cross-site scripting, SQL injection, printf() and similar issues on lists such as BugTraq. -- Glynn Clements <[EMAIL PROTECTED]> _______________________________________________ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell