Re: Detecting Encoding with plugins

Julien Nioche Wed, 15 Feb 2012 04:28:03 -0800

I assume Tika does already - why should we duplicate the tests in Nutch? we
delegate the functionality to Tika, IMHO this means delegating the testing
as well. What we could do to contribute tests to Tika instead if it does
not have any.


Re-any23 : why not handling it as a Tika parser instead of a Nutch one?
This could be useful to other Tika users who do not necessarily use Nutch

Julien

On 15 February 2012 12:17, Lewis John Mcgibbney
<[email protected]>wrote:

> Yes this is correct, but we still don't test for either of the two.
>
>
> On Wed, Feb 15, 2012 at 10:59 AM, Julien Nioche <
> [email protected]> wrote:
>
>> The mimetype is not the same thing as the encoding. As Ken pointed out
>> this is done at the individual parser level
>>
>>
>> On 14 February 2012 23:51, Markus Jelsma <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> This was indeed an issue until today. The detected type is in the crawl
>>> datum
>>> metadata.
>>>
>>> https://issues.apache.org/jira/browse/NUTCH-1259
>>>
>>> > Hi,
>>> >
>>> > I can't see anywhere within our parser plugins where we detect
>>> encoding of
>>> > documents. I've also begun looking through the o.a.n.p package but
>>> again I
>>> > can't see anything.
>>> >
>>> > Can anyone provide some detail on this please?
>>> >
>>> > Thank you
>>> >
>>> > Lewis
>>>
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>>
>
>
> --
> *Lewis*
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Detecting Encoding with plugins

Reply via email to