"James A. Treacy" <[EMAIL PROTECTED]> writes: > On Mon, Mar 20, 2000 at 11:31:52AM -0700, Greg McGary wrote: > > > > Files are of the form foo.lang.html, e.g. index.en.html. > > > > OK. That makes it very easy. What's the complete list of languages, > > and what charset encoding is used for each? I'm a lowly mono-lingual > > ugly American, but I have a brain-trust of i18n pros, so I'll get them > > to help me figure out how best to code language-specific scanners.
> Is it really necessary to know the charset used on the page? As long > as searches are 8 bit clean I would think that it wouldn't make a > difference. The issue is how does one delimit tokens? You need to know the character-classes in order to know when you have transitioned from one character class to another, and therefore ought to end the current token. You then need to know which sequences of character classes to keep and which to toss (keep "words", toss sequences of whitespace and punctuation). I suppose if non-word char classes (e.g., whitespace, punctuation) are consistent across all languages and charsets, then you can treat everything that's not a non-word as a word and be done with. I don't know enough about the subject to judge. > Unless you are interested in creating a general purpose cgi frontend > it is probably better if you work on the searching/indexing and > specialized parsers while we create the cgi interface. That's fine by me. The less I have to do the better. 8^) I'm sure I can mold the id-utils query interface to be whatever you like. > In a seperate mail you asked for help in creating the html parser > (I hope my terminology is correct). I call mkid's token gatherers "scanners" rather than "parsers", in order to emphasize their simple-minded lexical nature and fast execution. Since they must pick out keywords, they do a little parsing, but it's nothing close to the sophistication or overhead of, say, as a context-free grammar. > I'd love to help, but am already > overextended. :( I think Craig is going to give that a go. We've been discussing it offline already. If you can handle the cgi frontend, that's plenty useful. > With respect to parsers, do you have a suggestion on the best way to > handle the list archives? We currently generate a single file for each > list for each month (in standard mail format - some as big as 10MB). > Each file is then broken up into a directory containing one htmlized > file for each piece of mail. This generates a LOT of files. Do you > think it would be practical (from a speed point of view) to work > directly from the big files and extract the relevant mails on the fly? Don't queries need to return the html file names, since that's what the users will see? The users never see the 10 MB monthly files, do they? Assuming the html files are what we want to index, we should just index them with no fancy footwork to save the open(2) system calls. Email archives are index-once-and-for-all things, especially when mkid can build incrementally. The development time of teaching the scanner about how to scan a large file but make index entries as though it had scanned the html files doesn't seem worth the trouble. Greg

