Hi all, While chatting with Chris Mattmann, it seems to be evident to us that there is a need for a new sub-project within Lucene.
For now, Lucene's sub-projects used in Nutch are : 1. Lucene-java - The basis for search technology 2. Hadoop - The distributed computing platform 3. Nutch - The search engine that relies on Lucene and Hadoop. Since Nutch contains some value added pieces of code that focus on content analysis, we think it would be a good idea to split Nutch into a new sub-project based on content analysis manipulation. The components we have identified are : 1. MimeType Repository 2. Language Identifier 3. Content Signature (MD5Signature / TextProfileSignature / ...) (4. Generic Meta Data Infrastructure) (5. Charset Detector) (6. Parse Plugins Framework) The idea is to expose these pieces of codes into a standalone lib, since we are convinced they could be usefull in many other projects than Nutch. The benefits will be to have some code more widely used / tested / contributed. If this proposal is accepted, we have a candidate name for this new project: Tika (comes from my son ;-) ) Any comment is welcome. Jérôme
