Hello Henri, I was afraid this might be the case, so the library really is deprecated.
The project i'm working on implies multi-lingual environment, users, and files, so yes, having a good encoding detector is important. Thanks for the alternate recommendations, i see that they are C/C++ libraries but in theory they can be wrapped into a managed C++.NET assembly and consumed by a C# project. I haven't seen yet any existing C# ports that also handle charset detection. On Mon, May 22, 2017 at 5:49 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote: > On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor > <gabi.t.san...@gmail.com> wrote: > > I recently came across the Mozilla Charset Detectors tool, at > > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working > on > > a C# project where I could use a port of this library (e.g. > > https://github.com/errepi/ude) for advanced charset detection. > > It's somewhat unfortunate that chardet got ported over to languages > like Python and C# with its shortcomings. The main shortcoming is that > despite the name saying "universal", the detector was rather arbitrary > in what it detected and what it didn't. Why Hebrew and Thai but not > Arabic or Vietnamese? Why have a Hungarian-specific frequency model > (that didn't actually work) but no models for e.g. Polish and Czech > from the same legacy encoding family? > > The remaining detector bits in Firefox are for Japanese, Russian and > Ukrainian only, and I strongly suspect that the Russian and Ukrainian > detectors are doing more harm than good. > > > I'm not sure however if this tool is deprecated or not, and still > > recommended by Mozilla for use in modern applications. The tool page is > > archived and most of the links are dead, while the code seems to be at > > least 7-8 years old. Could you please tell me what's the status of this > > tool and whether I should use it in my project or look for something > else? > > I recommend not using it. (I removed most of it from Firefox.) > > I recommend avoiding heuristic detection unless your project > absolutely can't do without. If you *really* need a detector, ICU and > https://github.com/google/compact_enc_det/ might be worth looking at, > though this shouldn't be read as an endorsement of either. > > With both ICU and https://github.com/google/compact_enc_det/ , watch > out for the detector's possible guess space containing very rarely > used encodings that you really don't want content detected as by > mistake. > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform