> > From what I know it is terribly hard to detect
> encodings i.e.
> > differentiate between an iso-* encoding and utf8
> encoding. Any
> > document with any iso* encoding is also a valid utf8
> encoded document.
> >
>
> I have found the program chardet at
> http://chardet.feedparser.org/, which is
> based on statistical methods for detecting the
> encoding of files and is an
> adaptation of the method used in netscape browsers,
> written in python. This
> would be very useful for beagle, so I wonder whether
> beagle will implement
> this algorithm (for a description of it, check
> http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html)
> or
> should I propose this to the mono guys? I can start
> working on it, though you
> shouldn't expect much, as I'm not a CS guy.

There is a ongoing work regarding detecting of "language" of a piece
of text http://bugzilla.gnome.org/show_bug.cgi?id=354742 Detecting
charset sounds similar but not quite the same. If there is language
detection, then there is no reason we cannot have charset detection
:-)

Several of the filetypes specify the charset themselves like html, pdf
and other binary formats. The charset detector could be useful for the
others e.g. text files, latex files which do not specify the encoding.
Of course, all these slow down the indexing process quite a lot, so we
need options to turn them on or off, but those can come later.

AFAIK no one is yet working on this. If you are interested in this,
you can start by porting one of the libraries to C# or finding one or
convincing someone to port ;-). Keep us posted.

Thanks,
- dBera

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user
_______________________________________________
Dashboard-hackers mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/dashboard-hackers

Reply via email to