lucene in combination with pattern recognition...

bruce Tue, 20 Jun 2006 07:38:25 -0700

hi...

i'm looking at a problem and i can't figure out how to "easily" solve it...


basically, i'm trying to figure out if there's a way to use lucene/nutch
with some form of pattern matching to extract course information from a
College/Registrar's course section...

Assume I can point to a Regiatrar's section of a College site.
Assume I can then crawl through the section, and capture
 all the underlying information, including the Course
 information...
Is there a way to somehow use pattern matching/recognition
 to somehow interpret the DOM to pull out the class schedule
 information. I'm pretty sure there's no vanilla approach,
 so I'd even consider some kind of solution where I might
 have to intially evaluate/analyze the site, to tell it
 what DOM elements are "important"...

anyone done any work/projects like this...
any research/papers/sample apps i could look at...
any thoughts/comments/etc....

i could brute force this by writing a bunch of perl
scripts, with each script tied to a given registrar site,
but i'd like a more generalizable solution if one exists..

thanks

-bruce



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

lucene in combination with pattern recognition...

Reply via email to