Hello Henri,

I was afraid this might be the case, so the library really is deprecated.

The project i'm working on implies multi-lingual environment, users, and
files, so yes, having a good encoding detector is important. Thanks for the
alternate recommendations, i see that they are C/C++ libraries but in
theory they can be wrapped into a managed C++.NET assembly and consumed by
a C# project. I haven't seen yet any existing C# ports that also handle
charset detection.

On Mon, May 22, 2017 at 5:49 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote:

> On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor
> <gabi.t.san...@gmail.com> wrote:
> > I recently came across the Mozilla Charset Detectors tool, at
> > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working
> on
> > a C# project where I could use a port of this library (e.g.
> > https://github.com/errepi/ude) for advanced charset detection.
>
> It's somewhat unfortunate that chardet got ported over to languages
> like Python and C# with its shortcomings. The main shortcoming is that
> despite the name saying "universal", the detector was rather arbitrary
> in what it detected and what it didn't. Why Hebrew and Thai but not
> Arabic or Vietnamese? Why have a Hungarian-specific frequency model
> (that didn't actually work) but no models for e.g. Polish and Czech
> from the same legacy encoding family?
>
> The remaining detector bits in Firefox are for Japanese, Russian and
> Ukrainian only, and I strongly suspect that the Russian and Ukrainian
> detectors are doing more harm than good.
>
> > I'm not sure however if this tool is deprecated or not, and still
> > recommended by Mozilla for use in modern applications. The tool page is
> > archived and most of the links are dead, while the code seems to be at
> > least 7-8 years old. Could you please tell me what's the status of this
> > tool and whether I should use it in my project or look for something
> else?
>
> I recommend not using it. (I removed most of it from Firefox.)
>
> I recommend avoiding heuristic detection unless your project
> absolutely can't do without. If you *really* need a detector, ICU and
> https://github.com/google/compact_enc_det/ might be worth looking at,
> though this shouldn't be read as an endorsement of either.
>
> With both ICU and https://github.com/google/compact_enc_det/ , watch
> out for the detector's possible guess space containing very rarely
> used encodings that you really don't want content detected as by
> mistake.
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to