Bastian,

When trying to classify document using the approach of dynamic
classification, depending on the file type Nutch can take a awhile to parse
the data. While working with Nutch I have encountered some null pointer
exception due to parsing processes. This is due to a Hadoop configuration
that was not made available in Nutch-default.xml file. The settings should
allow Nutch to increase the time that hadoop have to wait before setting a
process as inactive. 

Some questions that you should investigate is how will your classification
process handles failed parsed and what about if the data is not parsed in a
text format (i.e. unsupported file type)? What happens to the index being
created if the classification fails; corrupted? In a multithreaded
environment such as Nutch, what happens to the concurrent classification
processes, mixed up data? I have a problem with Nutch now it seems not to be
able to generate dynamic fields based on documents while using more than a
single threads. The index becomes corrupted with mixed data from different
files in the wrong Lucene document. There are many other questions once you
start to work on your classification project.

Best regards

Armel

-----Original Message-----
From: Bastian Preindl [mailto:[EMAIL PROTECTED] 
Sent: 08 May 2007 13:38
To: [EMAIL PROTECTED]
Subject: Re: Document Classification - indexing question

Hi Armel,

thanks for you quick reply!

> I have been working on a similar project for the last couple of months but
I
> am taking a slightly different approach. Because fetching - parsing  -
> indexing can be time consuming and in my case, I also need the
unclassified
> indexes. Using classification algorithm and the Lucene API, I build
> classified indexes by using the first index as corpus. 
>   

This is definitely a good idea and a somewhat other approach as it moves 
the classification task out of Nutch and into Lucene. Are there any 
frameworks/plugins already available for applying document 
classification within Lucene? The much faster parsing and indexing 
process within Nutch if no "online" classification takes places stands 
against the disk space consumption which is some thousand times greater 
when indexing all parsed documents instead of indexing only the 
positively classified ones.

> Maybe we should discuss together on skype or MSN let me know. My skype is
> etapix.
>   

That would be really nice, thanks for the offer! I'll let you know my 
MSN-nummer after I've created an account.

Best regards

Bastian




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to