The code is in XMLParser: see XMLEncodingDetector. I can port it, if you think the algorithm is appropriate.
The YAML algorithm is actually a less restrictive version of this XML one: https://www.w3.org/TR/REC-xml/#sec-guessing The XML one is "Non-Normative" (ie optional), so I chose to implement the more general YAML algorithm instead. > Sent: Friday, March 16, 2018 at 6:44 AM > From: "Sven Van Caekenberghe" <[email protected]> > To: "Pharo Development List" <[email protected]> > Subject: Re: [Pharo-dev] Executive Summary of the recent FileStream Changes > > > > > On 16 Mar 2018, at 07:05, monty <[email protected]> wrote: > > > >> Sent: Thursday, March 15, 2018 at 4:01 PM > >> From: "Sven Van Caekenberghe" <[email protected]> > >> To: "Pharo Development List" <[email protected]> > >> Subject: [Pharo-dev] Executive Summary of the recent FileStream Changes > >> > >> Executive Summary of the recent FileStream Changes > >> > >> In Pharo 7 Guille Polito recently committed a heroic set of changes that > >> we were planning to do for a long time but were afraid to take on. > >> > >> The idea is to replace a couple of fat, overly complex, multi-functional, > >> do-all classes with a set of simpler single purpose classes that can be > >> combined as needed. > >> > >> The classes that we want to get rid of can be found in the package > >> DeprecatedFileSystem, in particular FileStream, StandardFileStream, > >> MultiByteFileStream, MultiByteBinaryOrTextStream and RWBinaryOrTextStream. > > > > StandardFileStream, at least, should remain for backwards compatibility and > > cross-platform compatibility with Squeak. It's a no-frills, non-decoding, > > non-LE normalizing stream that is heavily depended on. > > Hmm, maybe. > > The standard (no pun intended) interface to the file system in Pharo has been > FileSystem (FileReference) for quite a while. Many packages dealing with > either different Pharo versions or different Smalltalk implementations have > constructed their own portability facade (heck, I even did it in > ZnFileSystemUtils myself). > > Note however that some aspects (API, behaviour) about the streams themselves > changed as well (no longer being bivalent, separating reading/writing, > smaller/simpler API, sometimes no positioning). > > >> The replacements are can be found in packages Files and > >> Zinc-Character-Encoding-Core. > >> > >> Encoding and decoding characters to and from bytes is done using classes > >> that you wrap around a more primitive binary stream. The same goes for > >> buffering or translating line endings. > >> > >> For example, > >> > >> '/Users/sven/Desktop/foo.txt' asFileReference binaryReadStream. > >> > >> gives you a ZnBufferedWriteStream wrapping a BinaryWriteStream. > >> > >> While, > >> > >> '/Users/sven/Desktop/foo.txt' asFileReference readStream. > > > > What do you think about this algorithm for encoding detection: > > http://www.yaml.org/spec/1.2/spec.html#id2771184 > > > > I have an implementation (with tests), if you're interested. (I was waiting > > to propose it until the FileSystem API switched over to using Zn streams > > and encoders. The TextConverter API doesn't support UTF-32.) > > I did a primitive one in ZnCharacterEncoding class>>#detectEncoding: but I am > not happy with it. I will read your reference and I am certainly interested > in seeing your code ! > > >> gives a ZnCharacterReadStream wrapping a ZnBufferedWriteStream wrapping a > >> BinaryWriteStream. > >> > >> To translate line endings, we would wrap a ZnCharacterWriteStream using a > >> ZnCrPortableWriteStream. > >> > >> There are a couple of more specialised streams to cover special cases > >> (like read and writing at the same time). > >> > >> SocketStream remains another fat, overly complex, multi-functional, do-all > >> class, for which usable replacements exist in the form of ZdcSocketStream > >> and ZdcSecureSocketStream, which are simpler, cleaner and binary only. > >> > >> Of course, switching is more than replacing one class with a 100% > >> compatible alternative, that would give us the same complex result. The > >> challenge is to use a simpler API as well, to rethink how the streams are > >> used. You know, KISS. > >> > >> Of course, we are far from done and need more testing, debugging and help > >> from as many people as possible. > >> > >> Sven > >> > >> > >> -- > >> Sven Van Caekenberghe > >> Proudly supporting Pharo > >> http://pharo.org > >> http://association.pharo.org > >> http://consortium.pharo.org > > >
