Hi,

>
>    - Using a modified version of DmozParser to initialize crawlDB with the
>    URLs ANDD the top classification.
>    - Change the crawler to fetch the pages with and include two fields on
>    the webDB:
>       - One field for the classification (get from DMOZ file or classified
>       after by other process)
>       - Another field to indicate if the pages is on DMOZ list (so that
>       it's going to be the corpus of the svm training).
>    - After all pages are crawled, a new routine will generate the SVM
>    model by consulting webDB for the corpus, it's classification and the 
> number
>    of in links and out links.
>    - After that, we'll use the model to classify the other pages at WebDB.
>    - At last, we'll change the search module to get a parameter from the
>    QueryString that indicates the category (or "nothing" if the user chooses
>    not to filter by category) and do the query using it.
>
> It would probably simplify things a lot if you do the training stage
independently from Nutch. You will certainly try different sets of weighting
and parameters to assess the impact on the classification accuracy. Here's
how I'd do it :

* get a list of URLs and their DMOZ category
* inject that in the crawlDB and specify the category as metadata e.g.
http://www.myurl.com/ label=DMOZ_categ
* fetch and parse in the usual way

the next step would be to extract the documents and generate a training
corpus. IMHO an interesting aspect is that you have a semi-structured
information e.g. titles, anchors, meta descriptions, keywords and of course
the text itself. This should definitely give you more options as to what can
be used for generating the features. The TC API can take a semi structured
representation as input.

What you could do would be to write a custom indexer in order to benefit
from the fields from the NutchDocument at indexing time and generate a
training corpus for the ML. You then evaluate and train the model and when
that's done you can use it to classify new documents in Nutch.

This can be done as an indexing plugin. Basically the parser uses the TC API
to generate a new field which you can then use in SOLR (or in the
Nutch-Lucene index) to search / filter etc...

If you haven't done so it could be worth having a look at Apache Mahout as
well

HTH

Julien
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to