Bastian,

I have been working on a similar project for the last couple of months but I
am taking a slightly different approach. Because fetching - parsing  -
indexing can be time consuming and in my case, I also need the unclassified
indexes. Using classification algorithm and the Lucene API, I build
classified indexes by using the first index as corpus. 

Maybe we should discuss together on skype or MSN let me know. My skype is
etapix.

-----Original Message-----
From: Bastian Preindl [mailto:[EMAIL PROTECTED] 
Sent: 08 May 2007 11:31
To: [EMAIL PROTECTED]
Subject: Document Classification - indexing question

Hi,

I'd like to use Nutch for crawling parts of the web and automatically 
classify the fetched documents before indexing them. I've already done 
some investigations on how to achieve this and have read about different 
classification techniques like Bayes, SVM a.s.o. I've also already made 
some offline classification tests with several libraries and think that 
the best would be a pre-classification (has interesting content, doesn't 
have interesting content) using something similar to CRM114 and a 
"fine-grained" multi-classification afterwards with the interesting 
documents using SVM or something similar to Lingpipe.

My question is now: Which extension point is appropriate for such a 
plugin or extension and how can I avoid that documents which are not 
interesting are even indexed?

To illustrate my approach I'd like to apply the following actions 
step-by-step:

1: Fetch a new document from the web
2: Pre-classify the document (interesting, not interesting) with an 
already trained filter/classifier - positive: goto 3, negative: goto 4
3: Classify the interesting document using an already trained classifier 
having multi-classification-capabilities and index it with 
meta-information about the document's class/category, goto 5
4: Throw the document's content and the URL away, forget it, don't index 
it, goto 5
5: Fetch the next document (goto 1)

Which are the best points to "hook in" with such a classification and 
how do I tell Nutch to throw a document completely away and to not index it?

I would be very encouraged if somebody could provide some hints on this 
or (even better) a field report on how this can be achieved.

Thank you very much in advance

Bastian




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to