I also think it makes sense -- we use language idenfier component in
Carrot2 and we'd love to just have a single library for this
functionality. As always, some extra managerial effort is unfortunately
needed to drive a stand-alone project.
D.
Chris Mattmann wrote:
Hi Otis,
This thread seems to have gotten very little attention.
Jérôme - I'm all for extracting sub-libraries that can really live on its
own and are substantial enough to warrant "their own identity".
Personally, I'm the most interested in Language Identifier plugin becoming
a standalone, Nutch-independent piece. Doug had suggested we move it to
Lucene's contrib section. If you think it makes sense to have some of
these things lumped together, that's fine, too. It looks like Language
Identifier and Charset Detector may go well together.
Is this something you want/will push for and make happen?
Just to add to this, it's something that I would push for whole-heartedly.
In addition to Jerome, I would be happy to dedicate time to this
sub-project, and feel it's quite worthy of being its own Stand-alone
library.
Just my two cents, thanks!
Cheers,
Chris
Otis
----- Original Message ----
From: Jérôme Charron <[EMAIL PROTECTED]>
To: [email protected]
Sent: Friday, April 7, 2006 4:26:54 AM
Subject: [Proposal] New Lucene sub-project
Hi all,
While chatting with Chris Mattmann, it seems to be evident to us that
there
is a need for a new sub-project within Lucene.
For now, Lucene's sub-projects used in Nutch are :
1. Lucene-java - The basis for search technology
2. Hadoop - The distributed computing platform
3. Nutch - The search engine that relies on Lucene and Hadoop.
Since Nutch contains some value added pieces of code that focus on content
analysis,
we think it would be a good idea to split Nutch into a new sub-project
based
on content analysis
manipulation. The components we have identified are :
1. MimeType Repository
2. Language Identifier
3. Content Signature (MD5Signature / TextProfileSignature / ...)
(4. Generic Meta Data Infrastructure)
(5. Charset Detector)
(6. Parse Plugins Framework)
The idea is to expose these pieces of codes into a standalone lib, since
we
are convinced they could be usefull
in many other projects than Nutch.
The benefits will be to have some code more widely used / tested /
contributed.
If this proposal is accepted, we have a candidate name for this new
project:
Tika (comes from my son ;-) )
Any comment is welcome.
Jérôme
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers