Hi Tom,

On Sun, Sep 7, 2014 at 1:28 PM, <[email protected]> wrote:

>
> now when parsing HTML files these days Tika adds the charset attribute to
> the string.
>

Is this behavhiour consistent with other MimeTypes?


>
> I would have thought the normalize call was designed to remove this
> because tika-mimetypes.xml surely isn't supposed to contain charset
> matching tags?
>

That was also my own understanding.


>
> Anyway if you do
>
> Tika.detect(myurl)
>

This is a tricky one for me. If at all possible, I always try not to use
single argument methods within the Tika interface. I've found that the
detection algorithms work best when more information is provided as
parameters, e.g. using a Metadata object which has been populated with more
comprehensive values. I know that this doesn't help you much but if it is
at all possible then I would suggest you try the same.


>
> followed by
>
> MimeTypes.getRegisteredMimeType("text/html; charset=UTF-8");
>
> It returns null because it doesn't strip the charset, without it its fine.
>
> Bug/Feature/Misunderstanding?
>
> I would say the former however I can't reproduce this right now.
Can you possibly provide some output from the method calls so that we can
see exactly the output?
Thanks
Lewis

Reply via email to