Nutch with classification

Luan Cestari Thu, 08 Jul 2010 06:10:02 -0700

Julien and guys from nutch-dev,

Sorry for asking about textclassification on this nutch mailing list. This
mail is related to Nutch.


We would be very glad if you could just take a look at one of the following
steps. We're thinking to accomplish the classification by:

   - Using a modified version of DmozParser to initialize crawlDB with the
   URLs ANDD the top classification.
   - Change the crawler to fetch the pages with and include two fields on
   the webDB:
      - One field for the classification (get from DMOZ file or classified
      after by other process)
      - Another field to indicate if the pages is on DMOZ list (so that it's
      going to be the corpus of the svm training).
   - After all pages are crawled, a new routine will generate the SVM model
   by consulting webDB for the corpus, it's classification and the number of in
   links and out links.
   - After that, we'll use the model to classify the other pages at WebDB.
   - At last, we'll change the search module to get a parameter from the
   QueryString that indicates the category (or "nothing" if the user chooses
   not to filter by category) and do the query using it.


Any opinion, suggestion and tips will help us a lot. We appreciate your
attention.

Thanks.

Best Regards,
Luan Cestari & Daniel Gimenes

Nutch with classification

Reply via email to