Hi Lewis,
I have a patch on GitHub to upgrade to tika-1.2 that is currently
sitting on top of my test-resource-jar branch [1], although there
should not be any conflicts even if it is applied before that patch.
There is one test that is failing currently, and it is really quite
puzzling to me as a person who knows nothing about microformats. (Note
I fixed the error message below as it previously said "The model is
expected to be empty", which was temporarily confusing).
java.lang.AssertionError: The model is expected to not be
empty.Assertion failed! Extracted triples:
at
org.apache.any23.extractor.html.AbstractExtractorTestCase.assertModelNotEmpty(AbstractExtractorTestCase.java:281)
at
org.apache.any23.extractor.html.HCardExtractorTest.assertDefaultVCard(HCardExtractorTest.java:981)
at
org.apache.any23.extractor.html.HCardExtractorTest.testObjectDataDataUri(HCardExtractorTest.java:747)
What is so strange about the test failing is that the very similar
test in HCardExtractorTest.testObjectDataHttpUri() that looks like it
should produce less triples produces some triples.
I have included both the old any23 mimetypes.xml file along with both
the original tika-1.2 mimetypes.xml file and a patched version to make
virtually all of the any23 tests work. The choice of which
mimetypes.xml file to use is defined in the
core/src/main/resources/org/apache/any23/mime/tika-config.xml file.
Cheers,
Peter
[1] https://github.com/ansell/any23/compare/test-resource-jar...tika-12
On 8 August 2012 19:44, Lewis John Mcgibbney <[email protected]> wrote:
> Hi Peter,
> Thanks for the explanation and coverage.
> I think we should phase in this issue as a single entity. As you
> mention it does not get more complex with a modular restructuring,
> also it is important to get up to speed with the Tika deps as we are
> currently way behind.
>
> On Wed, Aug 8, 2012 at 3:10 AM, Peter Ansell <[email protected]> wrote:
>> Hi Lewis,
>>
>> It is a while since I did the update to Tika-1.1, but the upgrade
>> would be very easy to do independent of any module reorganisation,
>>
>> The major component involved updating mimetypes.xml and
>> tika-config.xml based on the resources extracted from the tika 1.1 jar
>> file.
>> https://github.com/ansell/any23/tree/ansellpatches/mime/src/main/resources/org/apache/any23/mime
>>
>> I also modified the default mime-type to match the current drafts for
>> each of the standards and added the previous mime types as aliases, as
>> Any23 has so far been using non-standard mime-types
>> https://github.com/ansell/any23/commit/8d3162c6510fa76aad0316e9e8be5ea66ee0fe7c
>>
>> Some of the test failures that I encountered were due to the addition
>> of license headers to the test files just before I started making my
>> changes. The license headers had periods inside comments that
>> incorrectly signalled the end of a statement to the mime detector
>> regexes. This was picked up since then and the license headers were
>> removed, but I think the mime type detection code still has a bug if
>> people put comments in the top of RDF NQuads or RDF NTriples files, as
>> it still relies on the period as a context-less delimeter.
>> https://github.com/ansell/any23/blob/trunk/core/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java#L96
>>
>> In terms of the actual detector, I ended up switching off the regex
>> pattern recognition and switching to an alternative method based on
>> more complex character based boundaries to extract a sample, which was
>> then parsed and if the parse succeeded then it was recognised as that
>> mime type. However, this may not be the best way to do it, although it
>> works for me so far. This change is the main part that needs review.
>> https://github.com/ansell/any23/blob/ansellpatches/mime/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java
>>
>> Peter
>>
>> On 7 August 2012 21:34, Lewis John Mcgibbney <[email protected]>
>> wrote:
>>> Hi Peter,
>>>
>>> Firstly thanks for the formal introduction glad that your now
>>> officially on board.
>>>
>>> I've changed the thread topic slightly to discuss what work you have
>>> done on your github branch regarding the Tika upgrade? I see that your
>>> using Tika 1.1? Would it be possible to phase this into the existing
>>> codebase before doing the module restructuring that we are currently
>>> discussing elsewhere?
>>>
>>> I vaguely remember you saying that there were some problems with tests
>>> or something (further to the Tika dependency upgrade) but I cannot
>>> confirm this just now and it would be great if you could refresh my
>>> mind.
>>>
>>> If we could review (with the intention to merge back into trunk) some
>>> of your work more incrementally then i think we can phase in it
>>> quicker... does this make sense?
>>>
>>> Thank very much
>>> Lewis