[Nutch-dev] [Proposal] New Lucene sub-project

Jérôme Charron Fri, 07 Apr 2006 01:29:16 -0700

Hi all,

While chatting with Chris Mattmann, it seems to be evident to us that there
is a need for a new sub-project within Lucene.


For now, Lucene's sub-projects used in Nutch are :
1. Lucene-java - The basis for search technology
2. Hadoop - The distributed computing platform
3. Nutch - The search engine that relies on Lucene and Hadoop.

Since Nutch contains some value added pieces of code that focus on content
analysis,
we think it would be a good idea to split Nutch into a new sub-project based
on content analysis
manipulation. The components we have identified are :

1. MimeType Repository
2. Language Identifier
3. Content Signature (MD5Signature / TextProfileSignature / ...)
(4. Generic Meta Data Infrastructure)
(5. Charset Detector)
(6. Parse Plugins Framework)

The idea is to expose these pieces of codes into a standalone lib, since we
are convinced they could be usefull
in many other projects than Nutch.
The benefits will be to have some code more widely used / tested /
contributed.
If this proposal is accepted, we have a candidate name for this new project:
Tika (comes from my son  ;-) )

Any comment is welcome.

Jérôme

[Nutch-dev] [Proposal] New Lucene sub-project

Reply via email to