On 11 May 2012 19:55, Michele Mostarda <[email protected]> wrote: > Hi Peter, > > I had a really quick look to your contribution, thanks for your effort. > > What I suggest is to provide your modifications as (possibly small) > patches (has you've already done).
I will look into the best way to gradually do that but it will likely wait until after 0.7.0 gets out the door. > On 11 May 2012 08:41, Peter Ansell <[email protected]> wrote: > >> Hi all, >> >> Over the past two days I have split up Any23 into a variety of modules >> to make it easier to use different parts of the Any23 API. You can see >> the code at [1]. The current module list in the parent pom reactor >> looks like: >> >> <modules> >> <module>api</module> >> <module>csvutils</module> >> <module>encoding</module> >> <module>mime</module> >> <module>core</module> >> <module>test-resources</module> >> <module>extractor</module> >> <module>cli</module> >> <module>test</module> >> <module>service</module> >> <module>plugins/basic-crawler</module> >> <module>plugins/html-scraper</module> >> <module>plugins/office-scraper</module> >> <module>plugins/integration-test</module> >> <module>sources-dist</module> >> </modules> >> > > The modularization refactoring at this stage introduces some complexity and > must be discussed > with the community in this mailing list, in particular with the Release > Manager (which has to deal with all these modules :) ). The main component that doesn't seem to have a place right now is the single utility class in csvutils. The functionality scope for the other modules seems fairly clean, except that the core module may still be able to be split up further into cleaner modules based on functionality. > >> All of the modules above core do not have dependencies on core, and >> the core module only has a dependency on the api module. >> >> The api module mostly contains interfaces but it also contains factory >> registries where they are fully Service Provider Interface (SPI) >> driven (Any23PluginManager and WriterFactoryRegistry which I created >> to alleviate the WriterRegistry hardcoding dependencies and >> reflection/annotation code that isn't easy to extend outside of the >> core library). The ExtractoryRegistry was too difficult to convert to >> SPI just yet so I split it up into an interface and an implementation >> (ExtractorRegistryImpl) with the interface in the API module and used >> in some APIs where the singleton was previously used. These >> registries, together with Rio RDFFormat for referencing RDF format >> information, seemed to be enough to remove the hardcoding that I have >> been discussing at https://issues.apache.org/jira/browse/ANY23-83 > > > That's really good. > > >> >> >> The changes fit my purposes as I can easily slot in the encoding and >> mime detection code without pulling in the core or extractor modules, >> and the supported types for the mime detection include any formats I >> register with OpenRDF Rio so it is extensible and modular for my >> purposes. >> >> However, most of the changes are too large for easy patching and I >> didn't arrange the changes into nice patches throughout as I was not >> sure what was going to happen in the end. I have submitted two very >> small patches to that issue, but there could be many more eventually >> if the redesigned code is acceptable. >> > > I understand, but it is difficult to and time consuming for us to pull > modifications > from an external repository. Most of the changes will be destined for 0.8.x, as I understand that 0.7.0 release is necessary to start getting people using the current code. >> Note, I also removed the Any23 NQuads implementation as it was missing >> Factory implementations for the writer and parser classes so it wasn't >> being picked up by Rio.createParser or any of the other static Rio >> methods. I replaced it with the NQuads implementation from Sesametools >> which includes these factories and so is recognised. When >> http://www.openrdf.org/issues/browse/SES-802 gets implemented both of >> these implementations will likely be deprecated anyway so it wasn't a >> major issue for me. I would suggest in either case splitting out the >> NQuads classes into a separate module and implementing a Factory for >> both the parser and writer so they are picked up by SPI. >> > > That would be fine. > > >> >> There were some existing broken tests when I started, and there were a >> small number of tests that broke throughout, including one that broke >> when I updated to Tika-1.1. They are temporarily ignored, but can be >> found easily by checking the ignored tests when running the test >> suite. >> >> > This is bad, the MIMEType detection is really central for the use cases > covered by the Any23 main users. I did not understand the cause of the breakage. It appeared in an extractor unit test I think, so I wouldn't have identified it as a Tika upgrade bug except that it was the single test that newly failed after the version bump. > I hope the changes are useful to others. >> > > I think so, it would be nice to have you more involved within the group > discussions. > > >> >> If you want to suggest changes to my version on GitHub feel free to >> open an issue or fork the repository and send a pull request back. >> > > Sure. > > >> >> Cheers, >> >> Peter >> > > Thanks a lot! > The best. > > Michele > > >> >> [1] https://github.com/ansell/any23 >> > > > > -- > Michele Mostarda > Senior Software Engineer > skype: michele.mostarda > twitter: micmos > mail: [email protected] > site : http://www.michelemostarda.com
