Hi Ted,

No, I've been using the FreeGenerator in Nutch so no linkDB has been built.
I suppose the text from anchors could make good features though.
Where you thinking about using the actual links as features?

J.

2008/9/19 Ted Dunning <[EMAIL PROTECTED]>

> Julien,
>
> That sounds great.
>
> Do you record linking information as well?
>
> On Fri, Sep 19, 2008 at 9:42 AM, Julien Nioche <
> [EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I am currently working on the classification of pages according to DMOZ
> :-)
> > I have been planning to give Mahout a serious try but never managed to do
> > it
> > so that could be a good opportunity to do that.
> >
> > We have downloaded and parsed the latest DMOZ snapshot. Everything is
> > currently stored in a DB, we have the following fields for each document:
> > - URL
> > - category (level 1 from DMOZ)
> > - content
> > - title
> > - description (taken from the HTML meta tags)
> > - keywords (taken from the HTML meta tags)
> > - status (unavailable|fetched)
> >
> > We are using our own API to convert the information for each document
> into
> > a
> > vector with a choice of which weighting scheme to use (tf-idf, frequency,
> > etc...). The weighting takes the fields into account i.e. if using tf.idf
> > the weight of a given term takes into account its frequency in this
> > specific
> > field (say title).
> >
> > I could describe the whole process on a Wiki page but that would be quite
> > long (especially if we need to go through all the details of Nutch),
> maybe
> > I
> > could simply generate a textual representation of the matrix and put it
> in
> > a
> > place where people could download it? That could be the starting point of
> > the use case. There would also be a lexicon file containing the mapping
> > between the attribute labels and their index.
> >
> > There could be all sorts of possible experiments from there e.g. trying
> to
> > see which attributes are the most discriminant etc...
> >
> > Does that make sense?
> >
> > Julien
> >
> >
> > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>
> >
> > > Amazon has generously donated some credits, so I plan on putting Mahout
> > up
> > > and doing some testing.  Was wondering if people had suggestions on
> > things
> > > they would like to see from Mahout.  For starters, I'm going to put up
> a
> > > public image containing 0.1 when it's ready, but I'd also like to wiki
> up
> > > some examples.  I.e. go here, get this data, put it in this format and
> > then
> > > do X.  We have some simple examples, but I think it would be cool to
> show
> > > how to do something a bit more complex, like maybe classify web pages
> > > according to DMOZ or to cluster on stuff, or maybe put in a large
> > traveling
> > > salesman problem using the GA stuff Deneche did.
> > >
> > > Thoughts?  Anyone else interested in setting up some use cases?
> > >
> > > -Grant
> > >
> >
>
>
>
> --
> ted
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to