Hi All
TL;DR - There's a handful of Java mini-projects, one per file format,
each with a library and command-line tools, in and around Apache Tika.
Would Commons be a good Apache home for them?
Apache Tika, for those who don't know, is a toolkit for detecting file
types, then extracting consistent structured metadata and content. It
wraps a whole bunch of other Java libraries, and hides all the
complexity from users.
In a few cases, there hasn't been a suitably licensed / available
library for a format that Tika wanted to support, so we've ended up
having to write our own. As part of an experiment, some of these are in
the Tika codebase, and some are hosted externally. A few of them are
generally useful, in particular the Ogg and the MP3 ones.
For the formats where the support code is in Tika, we're not seeing any
re-use beyond Tika. The code is embedded in the Tika Parsers jar, and
no-one would think to look in there for some generic file format code.
Nor would you really expect to find it in Tika anyway, even if it had
its own jar. For the Ogg code, which we've tried hosting on Github,
there has been some re-use of the code. There hasn't been all that much
visibility though, and releasing without the Apache infrastructure can
be a bit of a pain, plus one single person needs to take charge of the
project.
For Ogg, as well as the Java library code, there's the Tika plugin code,
and command line tools. No audio encoding/decoding yet, but much of the
work is there if someone wanted to finish it off. We're considering
adding a SAS7BDAT library to this little grouping shortly too, which as
well as being used by Apache Tika, would also be used by Apache
Metamodel, possibly some others too, and would have command line tools.
Following some discussions last week at ApacheCon / Apache Big Data /
ApacheCon BarCamp on this, it was suggested we try asking here if you
think these could have a good home in Apache Commons? On the one hand,
they are in Java, and are re-usable. On the other, they have command
line tool packages as well, which doesn't seem that commons-like, ditto
the multimedia encoding/decoding parts which are nearly there.
What do you all think? Could Commons be a suitable home for them? Or
should we look elsewhere? (We do have a backup idea if needed)
Thanks
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]