Y. Sorry. At beach last week. Took care of quick issues yesterday, will try to return to your PRs today. Thank you!
On Thu, Jul 26, 2018 at 5:38 AM gbouchar <gbouc...@protonmail.com.invalid> wrote: > Greetings everyone! > > I have two pull requests related to the use of tika for web contents that > have been waiting for quite some time now. > > - [Improving html charset detection]( > https://github.com/apache/tika/pull/242) : None of the current charset > detectors in tika respect the web standards, and in my tests, I found that > around 15% of web pages were misdetected using the default charset > detector. This pull request implements a new charset detector for web > pages, with a better accuracy. > - [fixing mime-type detection over http]( > https://github.com/apache/tika/pull/236) : Currently, tika has no > knowledge of server-side interpreted languages such as PHP. Thus, given an > url like "http://example.com/index.php", it tends to guess its mime type > will be "text/x-php", whereas this is in fact very unlikely. This PR gives > tika the knowledge of which extensions are linked to server-side > interpreted languages. > > If someone could have a look at these pull requests, and maybe include > them in the next release, that would help us a lot ! I am of course still > opened to discussion and ready to update the code if changes need to be > made. > > Cheers, > G. Bouchar