Hi Julien, Thanks for the reply. I'm reading about the plugins.
It is good to know that you did a project like that. I found very useful the API that you have passed. I'll try to use it due to have already the java interface and the weighting. One thing that I thought is that I need to change the way it search too, like to get in the QueryString the category and do the search using it. For the training corpus, I will choose some categories from DMOZ and will get the seeds from that due they are already classified. So, the idea is using the part of DMOZ data as training corpus and classify others pages crawled. I don't know if it is easy to do but I looks like a good project. I'm thinking to use feature selection to not use an enormous number of terms to use in the SVM. I saw that there is a nutch plugin that works with DMOZ data and It will help a lot (but, as I said before, that plugin ins't the way I'm think to use). If you have any other suggestions, I'll be very glad to read :) Thanks again, Luan Cestari On Tue, Jul 6, 2010 at 9:01 AM, Julien Nioche <[email protected] > wrote: > Hi Cesar, > > This can definitely be done using a custom parse plugin and an indexing > plugin. We did something like this sometime ago to classify adult pages > using our text classification API ( > http://code.google.com/p/textclassification/) which is based on SVM. > > Out of interest, what categories are you planning to use and how will you > build the training corpus? > > HTH > > Julien > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > > > On 6 July 2010 12:51, Luan Cestari <[email protected]> wrote: > >> Nutch Developers, >> >> I'm at the last year of Computer Science and my graduation project is >> related to web search. The plan is to add a filter of page's category to >> Nutch, in a attempt to use SVM to classify the crawled pages. >> >> So I ask you: do you think I'll have to change internals of Nutch or can >> this be done with plugins? >> >> Thanks. >> >> Best Regards, >> Luan Cestari >> > > >

