improving Tika for web contents

gbouchar Thu, 26 Jul 2018 02:38:34 -0700

Greetings everyone!

I have two pull requests related to the use of tika for web contents that have 
been waiting for quite some time now.


- [Improving html charset detection](https://github.com/apache/tika/pull/242) : 
None of the current charset detectors in tika respect the web standards, and in 
my tests, I found that around 15% of web pages were misdetected using the 
default charset detector. This pull request implements a new charset detector 
for web pages, with a better accuracy.
- [fixing mime-type detection over 
http](https://github.com/apache/tika/pull/236) : Currently, tika has no 
knowledge of server-side interpreted languages such as PHP. Thus, given an url 
like "http://example.com/index.php";, it tends to guess its mime type will be 
"text/x-php", whereas this is in fact very unlikely. This PR gives tika the 
knowledge of which extensions are linked to server-side interpreted languages.

If someone could have a look at these pull requests, and maybe include them in 
the next release, that would help us a lot ! I am of course still opened to 
discussion and ready to update the code if changes need to be made.

Cheers,
G. Bouchar

improving Tika for web contents

Reply via email to