Hi Peter, Thanks for the explanation and coverage. I think we should phase in this issue as a single entity. As you mention it does not get more complex with a modular restructuring, also it is important to get up to speed with the Tika deps as we are currently way behind.
On Wed, Aug 8, 2012 at 3:10 AM, Peter Ansell <[email protected]> wrote: > Hi Lewis, > > It is a while since I did the update to Tika-1.1, but the upgrade > would be very easy to do independent of any module reorganisation, > > The major component involved updating mimetypes.xml and > tika-config.xml based on the resources extracted from the tika 1.1 jar > file. > https://github.com/ansell/any23/tree/ansellpatches/mime/src/main/resources/org/apache/any23/mime > > I also modified the default mime-type to match the current drafts for > each of the standards and added the previous mime types as aliases, as > Any23 has so far been using non-standard mime-types > https://github.com/ansell/any23/commit/8d3162c6510fa76aad0316e9e8be5ea66ee0fe7c > > Some of the test failures that I encountered were due to the addition > of license headers to the test files just before I started making my > changes. The license headers had periods inside comments that > incorrectly signalled the end of a statement to the mime detector > regexes. This was picked up since then and the license headers were > removed, but I think the mime type detection code still has a bug if > people put comments in the top of RDF NQuads or RDF NTriples files, as > it still relies on the period as a context-less delimeter. > https://github.com/ansell/any23/blob/trunk/core/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java#L96 > > In terms of the actual detector, I ended up switching off the regex > pattern recognition and switching to an alternative method based on > more complex character based boundaries to extract a sample, which was > then parsed and if the parse succeeded then it was recognised as that > mime type. However, this may not be the best way to do it, although it > works for me so far. This change is the main part that needs review. > https://github.com/ansell/any23/blob/ansellpatches/mime/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java > > Peter > > On 7 August 2012 21:34, Lewis John Mcgibbney <[email protected]> > wrote: >> Hi Peter, >> >> Firstly thanks for the formal introduction glad that your now >> officially on board. >> >> I've changed the thread topic slightly to discuss what work you have >> done on your github branch regarding the Tika upgrade? I see that your >> using Tika 1.1? Would it be possible to phase this into the existing >> codebase before doing the module restructuring that we are currently >> discussing elsewhere? >> >> I vaguely remember you saying that there were some problems with tests >> or something (further to the Tika dependency upgrade) but I cannot >> confirm this just now and it would be great if you could refresh my >> mind. >> >> If we could review (with the intention to merge back into trunk) some >> of your work more incrementally then i think we can phase in it >> quicker... does this make sense? >> >> Thank very much >> Lewis >> >> On Tue, Aug 7, 2012 at 1:09 AM, Peter Ansell <[email protected]> wrote: >>> Hi all, >>> >>> I am a software engineer with a PhD in Computer Science. I have worked >>> on a number of RDF related projects since the start of my PhD, mainly >>> using Sesame, including also integrating Sesame with OWLAPI [1] over >>> the last few months to suit my current projects needs. >>> >>> I am looking in the short term to restructure the Maven modules inside >>> of Any23 so that the different facets can be reused, tested and >>> maintained easily, particularly with a view to using the RDF related >>> Tika enhancements that the Any23 MIME Detector provides. I made these >>> changes a few months ago in my GitHub fork [2], so feel free to review >>> them closely to suggest enhancements before I actually start. I am not >>> sure when I will next have time to clean up the patches. The first >>> step that I want to take is to split out the test resources into a >>> single module and switch from "src/test/resources/*" File based access >>> in tests to using this.getClass().getResourceAsStream("*"). I have >>> implemented those changes in my git repository but the patches may >>> need cleaning up as I have not gone back to review them yet. After >>> that is done, it will be relatively simple to split out both the >>> packages and tests into separate modules. >>> >>> In the short term I have also been tasked by the Sesame Developers >>> with merging the Any23 and Sesametools NQuads parsers and integrating >>> the resulting module into the Sesame Rio package. Then we can have a >>> rock-solid, standards-based, NQuads parser/writer that everyone can >>> easily reuse in a similar way to the other Rio parsers/writers. This >>> is the culmination of the http://www.openrdf.org/issues/browse/SES-802 >>> issue that Michele opened over a year ago. >>> >>> Cheers, >>> >>> Peter >>> >>> [1] https://github.com/ansell/owlapi >>> [2] https://github.com/ansell/any23 >>> >>> On 4 August 2012 12:25, Mattmann, Chris A (388J) >>> <[email protected]> wrote: >>>> Hi Folks, >>>> >>>> A while back, the Any23 PPMC and the Incubator PMC VOTEd to add Peter >>>> Ansell >>>> to our ranks as a PPMC member and committer. Peter, welcome! >>>> >>>> Feel free to say a bit about yourself! >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >> >> >> >> -- >> Lewis -- Lewis
