Hi Peter, Thanks for your help and for a detailed explanation of what you did!
I for one, would be super supportive if you had time to figure out a way to get it into Apache Any23. I'm sure the rest of the PPMC would be happy and willing to work with you to develop JIRA issues/patches, etc., to facilitate this. Thank you again for your work! Cheers, Chris On May 10, 2012, at 8:41 PM, Peter Ansell wrote: > Hi all, > > Over the past two days I have split up Any23 into a variety of modules > to make it easier to use different parts of the Any23 API. You can see > the code at [1]. The current module list in the parent pom reactor > looks like: > > <modules> > <module>api</module> > <module>csvutils</module> > <module>encoding</module> > <module>mime</module> > <module>core</module> > <module>test-resources</module> > <module>extractor</module> > <module>cli</module> > <module>test</module> > <module>service</module> > <module>plugins/basic-crawler</module> > <module>plugins/html-scraper</module> > <module>plugins/office-scraper</module> > <module>plugins/integration-test</module> > <module>sources-dist</module> > </modules> > > All of the modules above core do not have dependencies on core, and > the core module only has a dependency on the api module. > > The api module mostly contains interfaces but it also contains factory > registries where they are fully Service Provider Interface (SPI) > driven (Any23PluginManager and WriterFactoryRegistry which I created > to alleviate the WriterRegistry hardcoding dependencies and > reflection/annotation code that isn't easy to extend outside of the > core library). The ExtractoryRegistry was too difficult to convert to > SPI just yet so I split it up into an interface and an implementation > (ExtractorRegistryImpl) with the interface in the API module and used > in some APIs where the singleton was previously used. These > registries, together with Rio RDFFormat for referencing RDF format > information, seemed to be enough to remove the hardcoding that I have > been discussing at https://issues.apache.org/jira/browse/ANY23-83 > > The changes fit my purposes as I can easily slot in the encoding and > mime detection code without pulling in the core or extractor modules, > and the supported types for the mime detection include any formats I > register with OpenRDF Rio so it is extensible and modular for my > purposes. > > However, most of the changes are too large for easy patching and I > didn't arrange the changes into nice patches throughout as I was not > sure what was going to happen in the end. I have submitted two very > small patches to that issue, but there could be many more eventually > if the redesigned code is acceptable. > > Note, I also removed the Any23 NQuads implementation as it was missing > Factory implementations for the writer and parser classes so it wasn't > being picked up by Rio.createParser or any of the other static Rio > methods. I replaced it with the NQuads implementation from Sesametools > which includes these factories and so is recognised. When > http://www.openrdf.org/issues/browse/SES-802 gets implemented both of > these implementations will likely be deprecated anyway so it wasn't a > major issue for me. I would suggest in either case splitting out the > NQuads classes into a separate module and implementing a Factory for > both the parser and writer so they are picked up by SPI. > > There were some existing broken tests when I started, and there were a > small number of tests that broke throughout, including one that broke > when I updated to Tika-1.1. They are temporarily ignored, but can be > found easily by checking the ignored tests when running the test > suite. > > I hope the changes are useful to others. > > If you want to suggest changes to my version on GitHub feel free to > open an issue or fork the repository and send a pull request back. > > Cheers, > > Peter > > [1] https://github.com/ansell/any23 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
