Y. Sorry. At beach last week. Took care of quick issues yesterday, will try
to return to your PRs today. Thank you!

On Thu, Jul 26, 2018 at 5:38 AM gbouchar <gbouc...@protonmail.com.invalid>
wrote:

> Greetings everyone!
>
> I have two pull requests related to the use of tika for web contents that
> have been waiting for quite some time now.
>
> - [Improving html charset detection](
> https://github.com/apache/tika/pull/242) : None of the current charset
> detectors in tika respect the web standards, and in my tests, I found that
> around 15% of web pages were misdetected using the default charset
> detector. This pull request implements a new charset detector for web
> pages, with a better accuracy.
> - [fixing mime-type detection over http](
> https://github.com/apache/tika/pull/236) : Currently, tika has no
> knowledge of server-side interpreted languages such as PHP. Thus, given an
> url like "http://example.com/index.php";, it tends to guess its mime type
> will be "text/x-php", whereas this is in fact very unlikely. This PR gives
> tika the knowledge of which extensions are linked to server-side
> interpreted languages.
>
> If someone could have a look at these pull requests, and maybe include
> them in the next release, that would help us a lot ! I am of course still
> opened to discussion and ready to update the code if changes need to be
> made.
>
> Cheers,
> G. Bouchar

Reply via email to