hi... i'm looking at a problem and i can't figure out how to "easily" solve it...
basically, i'm trying to figure out if there's a way to use lucene/nutch with some form of pattern matching to extract course information from a College/Registrar's course section... Assume I can point to a Regiatrar's section of a College site. Assume I can then crawl through the section, and capture all the underlying information, including the Course information... Is there a way to somehow use pattern matching/recognition to somehow interpret the DOM to pull out the class schedule information. I'm pretty sure there's no vanilla approach, so I'd even consider some kind of solution where I might have to intially evaluate/analyze the site, to tell it what DOM elements are "important"... anyone done any work/projects like this... any research/papers/sample apps i could look at... any thoughts/comments/etc.... i could brute force this by writing a bunch of perl scripts, with each script tied to a given registrar site, but i'd like a more generalizable solution if one exists.. thanks -bruce --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]