Julien and guys from nutch-dev,
Sorry for asking about textclassification on this nutch mailing list. This
mail is related to Nutch.
We would be very glad if you could just take a look at one of the following
steps. We're thinking to accomplish the classification by:
- Using a modified version of DmozParser to initialize crawlDB with the
URLs ANDD the top classification.
- Change the crawler to fetch the pages with and include two fields on
the webDB:
- One field for the classification (get from DMOZ file or classified
after by other process)
- Another field to indicate if the pages is on DMOZ list (so that it's
going to be the corpus of the svm training).
- After all pages are crawled, a new routine will generate the SVM model
by consulting webDB for the corpus, it's classification and the number of in
links and out links.
- After that, we'll use the model to classify the other pages at WebDB.
- At last, we'll change the search module to get a parameter from the
QueryString that indicates the category (or "nothing" if the user chooses
not to filter by category) and do the query using it.
Any opinion, suggestion and tips will help us a lot. We appreciate your
attention.
Thanks.
Best Regards,
Luan Cestari & Daniel Gimenes