Hey guys

I was doing some stuff related to MimeTypes.getRegisteredMimeType and within that method it calls

registry.normalize(type)

now when parsing HTML files these days Tika adds the charset attribute to the string.

I would have thought the normalize call was designed to remove this because tika-mimetypes.xml surely isn't supposed to contain charset matching tags?

Anyway if you do

Tika.detect(myurl)

followed by

MimeTypes.getRegisteredMimeType("text/html; charset=UTF-8");

It returns null because it doesn't strip the charset, without it its fine.

Bug/Feature/Misunderstanding?

Regards

Tom
--
*Tom Barber* | Technical Director

meteorite bi
*T:* +44 20 8133 3730
*W:* www.meteorite.bi | *Skype:* meteorite.consulting
*A:* Surrey Technology Centre, Surrey Research Park, Guildford, GU2 7YG, UK

Reply via email to