Re: Document Classification - indexing question

Bastian Preindl Tue, 08 May 2007 05:37:53 -0700

Hi Armel,

thanks for you quick reply!

I have been working on a similar project for the last couple of months but I
am taking a slightly different approach. Because fetching - parsing  -
indexing can be time consuming and in my case, I also need the unclassified
indexes. Using classification algorithm and the Lucene API, I build

classified indexes by using the first index as corpus.

This is definitely a good idea and a somewhat other approach as it movesthe classification task out of Nutch and into Lucene. Are there anyframeworks/plugins already available for applying documentclassification within Lucene? The much faster parsing and indexingprocess within Nutch if no "online" classification takes places standsagainst the disk space consumption which is some thousand times greaterwhen indexing all parsed documents instead of indexing only thepositively classified ones.

Maybe we should discuss together on skype or MSN let me know. My skype is
etapix.

That would be really nice, thanks for the offer! I'll let you know myMSN-nummer after I've created an account.


Best regards

Bastian

Re: Document Classification - indexing question

Reply via email to