Kirby Bohling wrote:
I think you're correct about it being worth while. I've got a git
repository that I use for my work, I'll see about setting up a github
and start to use that as a public place to get some of my stuff so you
can see it. Unfortunately, I have some proprietary stuff that I can't
contribute back (most of which you don't want anyways). I do have
bugfixes for core issues that I do have permission to contribute.
It'd be much easier for me to use Git to migrate the work back and
forth between work and there. It's also much smoother for me to
develop a series of "easy to review" patches using it.
This is ok at this early stage - although sooner or later the patches
need to appear in JIRA and be submitted with a grant of the ASL license.
I'm guessing that Tika isn't ready for this. Given that it's an
Apache and/or Lucene project, it can probably be addressed. My guess
is that a number of the libraries they depend upon won't be.
I think we would like Tika to function as an OSGI plugin (or a group of
plugins?) out of the box so that we could avoid having to wrap it ourselves.
I think Tika as one plugin would lead to a charge of "bloat", given
all the formats it currently supports that you now ship as plugins.
The cumulative weight of our plugins is also significant.
Long term do you see Nutch just supporting everything Tika does "out
of the box" and including all of the dependencies. Thus folding most
of the parser plug-ins into one. My understanding is that Tika is
nothing more then a port of the Nutch library into a single unified,
and re-usable library. We might need help/support from Tika if the
answer is to split them up.
IMHO it would be good to include all parsers, but provide a mechanism
for a la carte configuration of active parsers, and a mechanism for
using other parsers packaged as OSGI plugins instead of the Tika ones.
I'd love to help. I've mostly fought along the edges of this problem,
rather then worked on it directly. I've written an OSGi service or
two, but I'm not sure it correctly handled all of the lifecycle issues
and other critical details.
I've played with your current system, and I know you'll have problems
with OSGi, pretty much straight out of the box. I wanted a docx
parser, so I upgraded to Tika 0.3 and packaged the latest POI jars in
a new plug-in, and I had pretty much exactly the problem I described
with Class.forName() with the current plug-in system, because Tika
uses Class.forName(). Tika was in the core class-loader, and the
classes I needed where only in my docx plugin (core can't see system
plugins). So Tika 0.3 couldn't find them. There are also a couple of
small bug fixes for core in the API that I have, that it'd be nice to
see get integrated, then we could upgrade to Tika 0.3 at least.
Tika is already at 0.4, maybe some things changed.
I'll go hack on this tonight and tomorrow and see where I get. I
think it's likely that Tika (or the dependent libraries), will need
significant work on packaging and the like. I'm assuming that Felix
is the OSGi implementation you'd like to use by default?
No idea - I played shortly with both, the key being the word "played" ..
;) Equinox has fewer dependencies if I'm not mistaken?
I know somebody was fairly well along with this conversion 3-4 years
ago. Sami Siren is the name I associated with that. Anybody know
where all of that ended up? If nothing else, the boiler plate Ant
changes would be nice to have.
(http://wiki.apache.org/nutch/NutchOSGi)
How do you feel about build system modifications? It'd be much nicer
to use OSGi in a toolchain where dependency resolution was done for
us. I've looked at Ivy, but I couldn't seem to get it working. The
documentation and tutorials was just a bit terse, and I know how to
deal with Maven. I use Maven at my work all the time. When it works
it's glorious, when you've hit a bug, it can be a show stopper.
However, I know for a lot of folks it is a non-starter.
I acknowledge that maven may be superior to ant at tracking dependencies
... let's leave it at that ;)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com