Hi Markus, I've been vaguely keeping up with yourself and Julien's work on this.
I would really like to get a test case for this though! I'll try working towards this as a sub-target of another issue. For reference, there is a Tika mimeType test case here [1] and Tika document encoding test here [2]. Which we may or may not be interested in porting over to o.a.n? wdyt? Thanks Lewis [1] https://svn.apache.org/viewvc/incubator/any23/trunk/core/src/test/java/org/apache/any23/mime/TikaMIMETypeDetectorTest.java?view=markup [2] https://svn.apache.org/viewvc/incubator/any23/trunk/core/src/test/java/org/apache/any23/encoding/TikaEncodingDetectorTest.java?view=markup On Tue, Feb 14, 2012 at 11:51 PM, Markus Jelsma <[email protected]> wrote: > Hi, > > This was indeed an issue until today. The detected type is in the crawl > datum > metadata. > > https://issues.apache.org/jira/browse/NUTCH-1259 > > > Hi, > > > > I can't see anywhere within our parser plugins where we detect encoding > of > > documents. I've also begun looking through the o.a.n.p package but again > I > > can't see anything. > > > > Can anyone provide some detail on this please? > > > > Thank you > > > > Lewis > -- *Lewis*

