All,
I took a stab at the initial module structure based on Tim and my email
[1]. If a package didn't seem to fit with anything else I created an
individual project for it. If any of the groupings don't make sense or
folks think there are better ways to organize I'm happy to move stuff
around. Patches are welcome :). I have a JIRA created [2]. Commited
with rev 1723223.
There's still a good amount of outstanding work:
1) All this could use more testing. Especially with the external parsers.
2) As Tim has already raised there is the issue of dual maintaining
branches. There are likely some fixes in trunk that are not currently
applied to the 2.0 branch.
3) The tika-parser project is currently using the maven shade plugin and
that is causing issues creating the OSGi Manifest.MF file. I should be
able to find a way around this.
4) Still need to recreate the OSGi uber jar with all dependencies
packaged with the tika code.
5) There are still some classes in the tika-parser project. Should
these all be moved to core? A common project?...
6) Documentation. I could use some Wiki access. Username: BobPaulin.
7) There are some dependencies in the tika-parser project that were not
needed to compile any of the individual modules or run tests. Are they
still needed?
8) Where does the
org.apache.tika.parser.external.CompositeExternalParser ServiceLoader
(META-INF/services/org.apache.tika.parser.Parser) config belong. I
moved it to tika-core since that is where the class lives.
9) Subcomponent licenses. I moved them to the modules they belong in
but I need to figure out a way to make them bubble up to the uber jars.
Or perhaps they need to be dual maintained.
10) Anything I may be forgetting....;)
For the most part all the changes just to organize the existing
packages. There are a handful of changes to the test suite in order to
break some cyclical dependencies. Here's an overview of how the
projects interrelate at the moment:
tika-parser-modules
- /tika-advanced-module
- /tika-cad-module
-> tika-text-module [test]
- /tika-code-module
-> tika-text-module [test]
- /tika-database-module
-> tika-office-module [test]
- /tika-ebook-module
-> tika-text-module
- /tika-journal-module
-> tika-pdf-module
- /tika-multimedia-module
-> tika-web-module [test]
-> tika-office-module [test]
-> tika-pdf-module [test]
- /tika-office-module
-> tika-web-module [test]
-> tika-package-module [test]
-> tika-text-module [test]
- /tika-package-module
- /tika-pdf-module
-> tika-text-module [test]
-> tika-package-module [test]
-> tika-office-module [test]
- /tika-scientific-module
-> tika-text-module [test]
- /tika-text-module
-/tika-web-module
-> tika-text-module [test]
-> tika-package-module [test]
Very interested in feedback since we have been talking about this for a
bit but I'm sure actually seeing it will create more discussion. Looking
at how much simpler the individual pom files does seem to demonstrate
that this will be a good thing for the project.
Cheers,
- Bob
[1]
http://mail-archives.apache.org/mod_mbox/tika-dev/201508.mbox/%3C55CF4C19.6050503%40bobpaulin.com%3E
[2] https://issues.apache.org/jira/browse/TIKA-1824