On Friday 06 November 2009 20:47:00 Grant Ingersoll wrote: > On Nov 6, 2009, at 5:06 AM, Max Heimel wrote: > > II. Layer: Preprocessing > > The data is probably not structured enough to be directly processable > > by a machine, so it has to be preprocessed. This > > step could e.g. consist of extracting the blog fulltext from the > > crawl, stemming it, finding named entitites and tagging them. > > We currently think of using UIMA for this layer. > > This could likely be done as M/R jobs too and contributed to Mahout > utils module if so desired.
+1 Though I know of code* at TU for retrieving blog urls via Yahoo! Boss and "guessing" the rss feed url. In a first iteration this might be a nice way of getting around the problem of having to parse the html code and separating blog posting from comments from navigational code. Isabel * That is fine to publish under Apache Software License according to the guys at the research group. -- QOTD: If you lose a son you can always get another, but there's only one Maltese Falcon. -- Sidney Greenstreet, "The Maltese Falcon" |\ _,,,---,,_ Web: <http://www.isabel-drost.de> /,`.-'`' -. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: <xmpp://[email protected]>
signature.asc
Description: This is a digitally signed message part.
