Tres Seaver wrote: > M.-A. Lemburg wrote: > >> Shouldn't this encoding guessing be a separate function that you call >> on either a file or a seekable stream ? > >> After all, detecting encodings is just as useful to have for non-file >> streams. > > Other stream sources typically have out-of-band ways to signal the > encoding: only when reading from the filesystem do we pretty much > *have* to guess, and in that case the BOM / signature is the best > heuristic we have. Also, some non-file streams are not seekable, and so > can't be guessed via a pre-pass.
Sure there are non-seekable file streams, but at least when using StringIO-type streams you don't have that restriction. An encoding detection function would provide more use in other cases as well, so instead of hiding away the functionality in the open() constructor, I'm suggesting to make expose it via the codecs module. Applications can then use it where necessary and also provide their own defaults, using other heuristics. >> You'd then avoid having to stuff everything into >> a single function call and also open up the door for more complex >> application specific guess work or defaults. > >> The whole process would then have two steps: > >> 1. guess encoding > >> import codecs >> encoding = codecs.guess_file_encoding(filename) > > Filename is not enough information: or do you mean that API to actually > open the stream? Yes. The API would open the file, guess the encoding and then close it again. If you don't want that to happen, you could use the second API I mentioned below on the already open file. Note that this function could detect a lot more encodings with reasonably high probability than just BOM-prefixed ones, e.g. we could also add support to detect encoding declarations such as the ones we use in Python source files. >> 2. open the file with the found encoding > >> f = open(filename, encoding=encoding) > >> For seekable streams f, you'd have: > >> 1. guess encoding > >> import codecs >> encoding = codecs.guess_stream_encoding(f) I forgot to mention: This API needs to position the file pointer to the start of the first data byte. >> 2. wrap the stream with a reader for the found encoding > >> reader_class = codecs.getreader(encoding) >> g = reader_class(f) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 08 2010) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com