Greetings everyone! I have two pull requests related to the use of tika for web contents that have been waiting for quite some time now.
- [Improving html charset detection](https://github.com/apache/tika/pull/242) : None of the current charset detectors in tika respect the web standards, and in my tests, I found that around 15% of web pages were misdetected using the default charset detector. This pull request implements a new charset detector for web pages, with a better accuracy. - [fixing mime-type detection over http](https://github.com/apache/tika/pull/236) : Currently, tika has no knowledge of server-side interpreted languages such as PHP. Thus, given an url like "http://example.com/index.php", it tends to guess its mime type will be "text/x-php", whereas this is in fact very unlikely. This PR gives tika the knowledge of which extensions are linked to server-side interpreted languages. If someone could have a look at these pull requests, and maybe include them in the next release, that would help us a lot ! I am of course still opened to discussion and ready to update the code if changes need to be made. Cheers, G. Bouchar
