Re: Detecting Encoding with plugins

Julien Nioche Wed, 15 Feb 2012 05:07:51 -0800

Hi Lewis

I assume Tika does already - why should we duplicate the tests in Nutch?
>
> We don't want to I suppose. However the point I was trying to make was
> that as NUTCH-1259 detects the encoding type,
>
>
however we don't have an automated test to cover this, I assume the case is
> somewhat important or else the ticket for NUTCH-1259 wouldn't have been
> opened originally?
>


nope. NUTCH-1259 is about storing the mime-type value detected by Tika. It
is not the same as the encoding. This specific JIRA is not whether or not
we get the correct value but a purely functional one about where we store
it. There is not much to test wrt it



> I agree with you that general cases should be dealt with further upstream
> within Tika development itself, however as the encoding detection is done
> in Nutch within the cd metadata we may wish to get some test case to
> check... it's not a huge thing I suppose.
>

we do have tests for the EncodingDetector (TestEncodingDetector), which is
used by parse-html already. It is Ok to have that as it is our own parser.
As explained earlier, for the Tika parser the detection is delegated to the
Tika parser implementations and as such should be tested there.


>
>> we delegate the functionality to Tika, IMHO this means delegating the
>> testing as well. What we could do to contribute tests to Tika instead if it
>> does not have any.
>>
>> Yeah this is correct. I'm expecting you guys will know better than me but
> I would assume that Tika is mimetype and encoding detection compliant ;0)
>

I definitely do not pretend to know more than anyone else BTW :-) I don't
understand what you mean by 'compliant'. Perfect? Probably not. There was
an interesting experiment made by Ken on measuring the accuracy of the
charset detection in the Tika book - which anyone remotely interested in
Nutch should get BTW. There has been an interesting blog entry recently on
comparing the language detection in Tika and other libraries (cant find ref
and am in a hurry - sorry)


>
>
>> Re-any23 : why not handling it as a Tika parser instead of a Nutch one?
>> This could be useful to other Tika users who do not necessarily use Nutch
>>
> OK so I suppose this is completely open for discussion and I really
> welcome it as well. On one hand I see working with Any23 as a parse-any23
> plugin within Nutch as the first step in the road to answering this
> question. Regardless of whether Any23 graduates and is integrated into Tika
> itself or as a TLP you are completely right that it should be made as
> openly available to as many people. Personally I agree with you Julien.
>
> One last thing, I know this if off topic... but with regards to our
> microformats-reltag plugin... I think the RelTagParser could and should be
> move over to Any23. Any23 already supports extraction of an number of
> microformats. wdyt?
>

it would probably make sense as an initial step if you don't want to
venture in trying to wrap it as a Tika parser :-)

Julien



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Detecting Encoding with plugins

Reply via email to