Hi Ted, No, I've been using the FreeGenerator in Nutch so no linkDB has been built. I suppose the text from anchors could make good features though. Where you thinking about using the actual links as features?
J. 2008/9/19 Ted Dunning <[EMAIL PROTECTED]> > Julien, > > That sounds great. > > Do you record linking information as well? > > On Fri, Sep 19, 2008 at 9:42 AM, Julien Nioche < > [EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am currently working on the classification of pages according to DMOZ > :-) > > I have been planning to give Mahout a serious try but never managed to do > > it > > so that could be a good opportunity to do that. > > > > We have downloaded and parsed the latest DMOZ snapshot. Everything is > > currently stored in a DB, we have the following fields for each document: > > - URL > > - category (level 1 from DMOZ) > > - content > > - title > > - description (taken from the HTML meta tags) > > - keywords (taken from the HTML meta tags) > > - status (unavailable|fetched) > > > > We are using our own API to convert the information for each document > into > > a > > vector with a choice of which weighting scheme to use (tf-idf, frequency, > > etc...). The weighting takes the fields into account i.e. if using tf.idf > > the weight of a given term takes into account its frequency in this > > specific > > field (say title). > > > > I could describe the whole process on a Wiki page but that would be quite > > long (especially if we need to go through all the details of Nutch), > maybe > > I > > could simply generate a textual representation of the matrix and put it > in > > a > > place where people could download it? That could be the starting point of > > the use case. There would also be a lexicon file containing the mapping > > between the attribute labels and their index. > > > > There could be all sorts of possible experiments from there e.g. trying > to > > see which attributes are the most discriminant etc... > > > > Does that make sense? > > > > Julien > > > > > > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]> > > > > > Amazon has generously donated some credits, so I plan on putting Mahout > > up > > > and doing some testing. Was wondering if people had suggestions on > > things > > > they would like to see from Mahout. For starters, I'm going to put up > a > > > public image containing 0.1 when it's ready, but I'd also like to wiki > up > > > some examples. I.e. go here, get this data, put it in this format and > > then > > > do X. We have some simple examples, but I think it would be cool to > show > > > how to do something a bit more complex, like maybe classify web pages > > > according to DMOZ or to cluster on stuff, or maybe put in a large > > traveling > > > salesman problem using the GA stuff Deneche did. > > > > > > Thoughts? Anyone else interested in setting up some use cases? > > > > > > -Grant > > > > > > > > > -- > ted > -- DigitalPebble Ltd http://www.digitalpebble.com