Re: [Haskell-cafe] lazy boxed array and builder?
Hi Bas, I'm not sure the unpacking is absolutely necessary. It might be worth to give it a try with not-unpacked strict chunks. In some of my ByteString builder experiments, I even got better performance by not unpacking the ByteStrings in some of the intermediate data structures. My gut feeling says that compensating the extra indirection is not too hard provided you ensure that your chunks are large enough. The corner case of having lots of small lazy bytestrings is likely to be slower. Experiments will tell. best regards, Simon 2012/7/12 Bas van Dijk v.dijk@gmail.com: On 12 July 2012 15:35, Yves Parès yves.pa...@gmail.com wrote: I remember this discussion, lazy vectors would also enable an implementation of bytestring and (maybe) text only with unboxed vectors, unifying it all: type ByteString = Vector Word8 Yes, I would like to add a lazy storable vector to vector-bytestring[1] to make the API 100% consistent with bytestring. Ideally we would have a type like: data Lazy vector a = Empty | Chuck {-# UNPACK #-} !(vector a) (Lazy vector a) Unfortunately GHC can't unpack polymorphic fields. The next best thing is to use a type family which for each type of vector would return its lazy version (where the vector is unpacked in the cons cell). Then we would need a class for common operations on those lazy vectors. Regards, Bas [1] http://hackage.haskell.org/package/vector-bytestring https://github.com/basvandijk/vector-bytestring ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Hackage 2 maintainership
Hi Duncan, I just wanted to thank you and all the other guys pushing Hackage 2 towards a public release. I just tested the http://hackage.factisresearch.com/ instance and it's blazingly fast. Cool stuff! The reverse dependencies are also very useful. I know that sending patches instead of thanks would help more. I'll try that in a (hopefully) not too distant future :-) best regards and thanks again, Simon 2012/2/14 Duncan Coutts duncan.cou...@googlemail.com: Hi Ben, On 13 February 2012 23:44, Ben Gamari bgamari.f...@gmail.com wrote: Hey all, Those of you who follow the Haskell subreddit no doubt saw today's post regarding the status of Hackage 2. As has been said many times in the past, the primary blocker at this point to the adoption of Hackage 2 appears to be the lack of an administrator. Yes, much of it is lack of an individual to keep momentum up and keep everyone else motivated. While I'm keen that hackage moves forward, my volunteer time is spread too thin to be that person keeping everything organised. That said, where I spend my volunteer time is to a large part directed by what other people are doing, it's much more fun and motivating if there's other people working with you. Speaking of which, I spent much of this evening fixing things, more details below. It seems to me this is a poor reason for this effort to be held up. Having taken a bit of time to consider, I would be willing to put in some effort to get things moving and would be willing to maintain the haskell.org Hackage 2.0 instance going forward if necessary. That would be great. So in the short term I'm very happy to get the help and in the longer term I'm happy to hand over to anyone sensible who puts in the effort. That person could be you, someone else or a team of several people. More immediately, my general policy with commit access is to give it to anyone who's sent a few good patches. Currently there are 7 people with write access to the darcs repo on code.h.o. It is of course also fine for people to maintain their own public branches (which they can do using git or darcs, whichever). I currently have a running installation on my personal machine and things seem to be working as they should. On the whole, installation was quite trivial, so it seems likely that the project is indeed at a point where it can take real use (although a logout option in the web interface would make testing a bit easier). Yes, we're at the stage where we can run a public testing instance. You'll see there's a bit more to implement and test for a switchover. That being said, it would in my opinion be silly to proceed without fixing the Hackage trac. It was taken down earlier this year due to spamming[1] and it seems the recovery project has been orphaned. I would be willing to help with this effort, but it seems like the someone more familiar with the haskel.org infrastructure might be better equipped to handle the situation. I spent a couple hours on this this evening and I've finally fixed it (I hope). I still need to purge a bit of wiki/ticket spam (help apreciated there). Sadly I've had to blow away the previos login accounts, but I've semi-restored by copying the ghc trac accounts. So if you happen to have an account on the ghc trac, then your login should work for the hackage trac. Otherwise you'll need to re-register as if it was a new account. It seems that this process will go something like this, 1) Bring Hackage trac back from the dead Check. 2) Bring up a Hackage 2 instance along-side the existing hackage.haskell.org Yes, now that the trac is back, you can see what notes we have on the switchover process at: http://hackage.haskell.org/trac/hackage/wiki/HackageDB/2.0 Note also that the nice people at factisresearch.com have given us a VM with enough memory (8GB) for the purpose of running a public test with the full package set (in principle it should not need so much memory, but we currently keep unnecessary package metadata in memory). So thanks to you and others this evening motivating me, I've also taken Max's latest patches to my tar package (which coincidentally I released yesterday) and the corresponding hackage-server patch and set it running at: http://hackage.factisresearch.com/ This is running the latest upstream darcs version. I have also fired off a one-shot mirroring operation. This will mirror all the existing packages from hackage. It'll probably take half a day or so to complete since there's something like 30-40k tarballs to be copied over. I'll check the logs tomorrow hopefully and after that kick off a live/continuous mirror so it'll get new updates from the main hackage within 20-30 min or so. Last time Max and I tried this we were able to mirror almost all packages. Most of the unmirrorable ones at the time were due to packages with quirks in their tar format, which is what his tar patches were aimed at. So I'm
Re: [Haskell-cafe] List x ByteString x Lazy Bytestring
Hi John, I've used Haskell and GHC to solve particular real life application. 4 tools were developed and their function is almost the same - they modify textual input according to patterns found in the text. Thus, it Hmm, modification can be a problem for ByteStrings, since it entails copying. That could be worse for strict BytStrings than lazy, if in the lazy ByteString you can reuse many chunks. I understand now, that is probably the point. Two main possibilities: 1. your algorithm isn't suited for ByteStrings 2. you're doing it wrong The above indicates 1., but without a more detailed description and/or code, it's impossible to tell. Yes, it seems that the (1) is the point, because I split and re-build the bytestream many times during processing. For splitting it might be interesting to have a look at 'attoparsec' (http://hackage.haskell.org/package/attoparsec), which are parser combinators specialized to bytestring's. If the splitting still leaves large enough chunks of the original input intact (large enough being around 128 bytes), then you might be able to achieve a benefit. To use strict and lazy bytestrings efficiently, it helps a lot to know their internal representation (a slice of a char[] array and lists of slices of char[] arrays) and to think about what the different operations cost on this datastructure. (The source code of the library is actually not that hard to understand.) An obviously very expensive operation is concatenation (guaranteed copying for strict bytestrings at least a list traversal for lazy bytestrings). The blaze-builder library provides a type that supports O(1) concatenation of sequences of bytes and an efficient conversion to a lazy bytestring that has a large average chunk size. Large chunks are important to amortize the work spent on the boundary with the speedup gained due to the compact and cache efficient representation. If you keep using lists, but need a lot of concatenation using difference lists (http://hackage.haskell.org/package/dlist) also allows a O(1) concatenation. Note that the cost of having this O(1) concatenation is that the resulting list cannot be inspected, as long as it is in a form that supports O(1) concatenation. The same holds for the lazy bytestring builder provided by blaze-builder. Note that I'd be very curious, if you could achieve any further performance improvement using blaze-builder...which is no surprise, as I'm its author :-) Note also that the API has been cleaned up and will be published with the next release of the bytestring library. So just ignore the whole 'write' stuff. I should have separated it. good luck and best regards, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Attoparsec concatenating combinator
2011/6/6 Bryan O'Sullivan b...@serpentine.com: On Sun, Jun 5, 2011 at 11:00 AM, Yitzchak Gale g...@sefer.org wrote: If behind the scenes the concat is copying directly from slices of the original input, then no, in principle we're not saving much then. I thought there were *two* copies going on. If you're using the specialised functions like attoparsec's takeWhile, then all they do is return a view into the underlying array. No copying occurs until the concat itself. Now that I think of it: in principle, you could write a specialised concat that would check the pointer/offset/length combinations of its arguments and, if they all abutted perfectly, would just return a new view into that same array, sans copying. (You'd have to hide it behind unsafePerformIO, of course.) Why would you need 'unsafePerformIO'. You can scrutinise the 'PS' constructors of the slice without dropping down to IO. The required 'Eq' instance on 'ForeignPtr' is also present. Using a Builder for concatentation makes sense, if you want to exploit that copying a slice of the input array is cheaper right after it has been inspected (its fully cached) than later (as it is done when collecting slices in a list). However, you can only have one Builder at a time and some low-level meddling is probably required to interleave the feeding of the Parser with input arrays with the feeding of the Builder with free buffers. Nevertheless, for something like parsing Chunked HTTP content it would make a lot of sense. I'm inclined look into that once I finished porting the blaze Builder to the 'bytestring' library. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Attoparsec concatenating combinator
2011/6/7 Bryan O'Sullivan b...@serpentine.com: On Tue, Jun 7, 2011 at 1:40 AM, Simon Meier iridc...@gmail.com wrote: Why would you need 'unsafePerformIO'. You can scrutinise the 'PS' constructors of the slice without dropping down to IO. True. Oops :-) Using a Builder for concatentation makes sense, if you want to exploit that copying a slice of the input array is cheaper right after it has been inspected (its fully cached) than later (as it is done when collecting slices in a list). When I've measured this in the past, I've found that it's often faster to accumulate a list and then run concat at the end than to use blaze-builder directly. That was certainly the case wit GHC 6.12; I haven't remeasured with 7.0. That's why you'll see that some places in the aeson JSON library use blaze-builder, while others manipulate bytestrings directly. When creating a Builder that you run afterwards, then you essentially create a list of bytestring concatenations as a closure. It makes sense that you don't win with such an approach, as the concatenation still only happens after then end of this list is reached. What you'd need is to nest attoparsec's Parser monad with the 'Put' monad provided in the blaze-builder internals [1]. However, I'm not yet sure how to achieve this in a modular fashion. Perhaps, using iteratee's the right way might provide a good answer, but that's just guesswork. The performance characteristics of blaze-builder are such that for short output sequences working with bytestrings directly is sometimes favorable, as the setup overhead before the Builder can be executed needs to be amortized. However, Builders work with a single pass through the input data, while many bytestring operations require two passes. For example, 'Data.ByteString.pack' first determines the length of the input and only afterwards writes the data to memory. Using blaze-builder's 'fromWord8s' requires only a single traversal and is faster for lists of length 64 on my 32-bit Core 2 Duo machine. I expect a similar effect for concatenating lists of strict bytestrings. [1] http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Internal.html#g:3 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Efficient object identity (aka symbols as data)
2011/5/26 Jacek Generowicz jacek.generow...@cern.ch: On 2011 May 26, at 11:16, Christopher Done wrote: On 26 May 2011 10:45, Jacek Generowicz jacek.generow...@cern.ch wrote: What is the Haskell approach to efficient comparison and lookup of objects by their identity? Often you just provide your own and implement Eq. I should be able to run the program on data that becomes available at run time. Typically you define an id generator and zip anything coming from the input stream up with that generator. Makes sense. Whatever algorithm I choose to use for the optimization, will have to do lots of comparisons of Groups and Persons where their *identity* is all that matters: you don't need to look inside the objects. To achieve this abstraction the usual way is just implementing Eq: instance Eq Person where Person{personId=id1} == Person{personId=id2} = id1 == id2 Any comments on the relative efficiency of the above as compared to A == B in the context of data Foo = A | B | C | D | ... lots more ... ? (I imagine that a Sufficiently Smart Compiler could reduce (==) :: Person Person to just integer comparison.) I'm pretty sure GHC will do that for you. An approach similar to the one by Chris Done is to use a bi-directional map between Persons and Ints along the lines of data IdMap a = IdMap (IntMap a) (Map a Int) You can then associate a unique Int with each of your Persons and use this during your comparison. For associating Ints to the Persons, a simple fold or a State monad computation suffice. For the lookups on the further properties of Persons, an additional argument or the Reader monad will do. If you use a State monad and a single operation that associates an Int to a Person, then you additionally get the guarantee (inside a monadic computation) that no two Persons will be associated with the same Int. Efficiency-wise, you'll have O(log(n)) association, O(min(n,W)) access time, and O(1) comparison time with a very low constant factor. See the IntMap documentation for the O(min(n,W)) explanation. Additionally, the code is pure with all the nice properties that come with it. By the way this problem is very similar to the one of observable sharing. See this thread: http://www.haskell.org/pipermail/haskell-cafe/2008-February/039639.html best regards, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Policy for taking over a package on Hackage
2011/5/25 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com: On 25 May 2011 22:17, Stephen Tetley stephen.tet...@gmail.com wrote: Hi Ivan Forks are good, no? The Parsec experience has suggested to me at least, that new author's capping another author's work by bumping up to a major version, causes a significant difficulties even when the original author has gone. As for wl-pprint, it was a very tidy library in its original implementation - it's a pity it now has name clashes with Applicative. My feeling is that a new library in a new namespace with some attention to new combinator names would be better. Such as? I'm _hopeless_ at making up names... ;-) Having a new package would require a new name and new module namespace, let alone thinking up new names for combinators... Also, by clashes with Applicative, are you referring to empty and $ ? I'm not sure if a better name than empty can be found; as for $, maybe using pretty's notation of $$ and $+$ rather than $ and $$ ? What about 'emptyDoc'? Moreover, if you are changing the names of combinators, then moving them away from Applicative and Arrow would be a good idea; i.e., don't use +, as it already used by ArrowPlus. Moreover, if you can make a Monoid instance such that `mappend` equals , you would also make the library compatible to a future introduction of () = mappend. best regards, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform
2011/5/19 Antoine Latter aslat...@gmail.com: On Thu, May 19, 2011 at 3:06 PM, Simon Meier iridc...@gmail.com wrote: The core problem that drove me towards this solution is the abundance of different IntX and WordX types. Each of them requiring a separate Write for big-endian, little-endian, host-endian, lower-case-hex, and uper-case-hex encodings; i.e., currently, there are int8BE :: Write Int8 int16BE :: Write Int16 int32BE :: Write Int32 ... hexLowerInt8 :: Write Int8 ... and so on. As you can see (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html) this approach clutters the public API quite a bit. Hence, I'm thinking of using a separate type-class for each encoding; i.e., If Johan's work on Data.Binary and rewrite rules works out, then it would cut the exposed API in half, which helps. We could then use the module and package system to further keep the API clean, with builders which output a specific encoding could live in separate modules. This could also keep the names of the functions short, as well. That would require coming up with logical divisions for the functions you're creating, and I don't understand the big picture enough to help with that. class BigEndian a where bigEndian :: Write a This collapses the big-endian encodings of all 10 bounded-size (signed and unsigned) integer types under a single name with a well-defined semantics. Moreover, it's standard Haskell 98. For the hex-encodings, I'm thinking about providing type-classes class HexLower a where hexLower :: Write a class HexLowerNoLead a where hexLowerNoLead :: Write a ... for ASCII encoding and each of the standard Unicode encodings in a separate module. The user can then select the right ones using qualified imports. In most cases, he won't even need qualification, as mixing different character encodings is seldomly used. I think we may be at cross-purposes here, and might not even be discussing the same thing - I would imagine that any sort of 'Builder' type included in the bytestring package would only provide the core combinators for packing data into low-level binary formats, so discussions about text encoding issues, converting to hexidecimal and Html escaping are going above my head. This seems like what the 'text' package was written for - to separate out the construction of textual data from choosing its encoding. Are there use-cases where the 'text' package is too slow for this sort of approach? Take care, Antoine What do you think about such an interface? Is there another catch hidden, I'm not seeing? BTW, note that Writes are a pure compile time abstraction and are thought to be completely inlined. In typical, uses cases there's no efficiency overhead stemming from these typeclasses. best regards, Simon Yes, for example using the current 'text' package is sup-optimal for dyamically generating UTF-8 encoded HTML pages. The job is simple: the data which is originally held in standard Haskell types (e.g., String) needs to be HTML escaped and UTF-8 encoded and sprinkled with tags in between. For blaze-html using blaze-builder the cost for a tag is a memcpy of the corresponding tag and the cost for a single character is one call to the nested case statement determining if the char needs to be escaped (one memcpy of its escaped version) or what bytes need to be written for UTF-8 encoding the char. This solution works with a single output buffer. For a solution using the text library the cost of creating the underlying UTF-16 array is similar to the cost for blaze-builder. However, you now also need to UTF-8 encode the UTF-16 array. This costs you more than double, as now you also have to inspect every character of every tag. For ~50% of your data you suddenly have to spend a lot more effort! I agree that the text library is a good choice for representing Unicode data of an application. However, for high-performance applications it pays off to think of its output in binary form and exploit the offered shortcuts. That's where blaze-builder and the like come in. thanks for your input, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform
2011/5/20 Bas van Dijk v.dijk@gmail.com: On 19 May 2011 10:53, Johan Tibell johan.tib...@gmail.com wrote: Long term we'd like to switch bytestring over from ForeignPtr to ByteArray#, if possible. There are currently some technical obstacles to such a switch BTW I'm working with Roman Leshchinskiy to create the vector-bytestring package which provides: type ByteString = Data.Vector.Storable.Vector Word8 and exports the same API as the bytestring package (no support for lazy bytestrings yet) A storable vector still uses a ForeignPtr but maybe this will make the switch to unboxed Vectors (which use ByteArray#) easier. Expect some code to be up somewhere next week. Bas OK, that sounds interesting. I'm looking forward to your code. One trick, I'm using in the blaze-builder implementation is that the current buffer is denoted by a region (pf :: Ptr Word8, pe :: Ptr Word8) of the next free byte `pf` and the first byte after the buffer `pe`. This only works for pinned ByteArrays/ForeignPtrs. Do you know what the cost of such an array/foreignptr is? Moreover, after creation I could unpin the array. Do you perhaps know if thats possible in principle? best regards, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform
2011/5/20 Henning Thielemann schlepp...@henning-thielemann.de: Simon Meier schrieb: There are many providers of Writes. Each bounded-length-encoding of a standard Haskell value is likely to have a corresponding Write. For example, encoding an Int32 as a big-endian, little-endian, and host-endian byte-sequence is currently achieved with the following three functions. writeInt32BE :: Write Int32 writeInt32LE :: Write Int32 writeInt32HE :: Write Int32 I would like to avoid naming all these encodings individually. Maybe this one helps: http://hackage.haskell.org/packages/archive/storable-endian/0.2.4/doc/html/Data-Storable-Endian.html ? Thanks. I didn't know of that package. It for sure provides some valuable input. Moreover, it pointed me to Antoine Latter's byteorder package (http://hackage.haskell.org/package/byteorder). I might use it, although I would rather have the ByteOrder determined at compile time. Shouldn't GHC provide some means to determine this, as it's compiling for a fixed architecture anyway? What experience have you made Antoine when you implemented this package? best regards, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform
2011/5/20 Johan Tibell johan.tib...@gmail.com: Hi Simon, On Thu, May 19, 2011 at 10:46 PM, Simon Meier iridc...@gmail.com wrote: Write achieves this separation, but it has some costs which I'm not entirely comfortable with. First, it leads to lots of API duplication. For every type (e.g. Word) we want to be able serialize we have two morally identical functions writeWordhost :: Word - Write fromWordhost :: Word - Builder in the API, where the latter simply calls the former and does some additional wrapping. Yes, I agree with this duplication. I'll explain below what we gain from it. Note that I factored out the whole Write stuff into its own library (system-io-write) for the bytestring integration. Therefore, an end-user of bytestring will only see the Builder versions except he's doing more low-level stuff to gain some extra performance. There are (at least) two cases where I think the simple Builder API must perform well for it to be usable on its own: simple loops and sequential writes. To be specific, if the following two cases don't compile into near optimal code, there's a compiler bug we should fix. First, a simple loop: f :: [Word8] - Builder f [] = mempty f (x:xs) = singleton x `mappend` xs This code is already quite low level, there should be enough information here for the compiler to emit a simple loop with one buffer bounds check per iteration. Second, a bunch of sequential writes: g :: Word8 - Word8 - Word8 - Word8 - Builder g a b c d = singleton `mappend` (b `mappend` (c `mappend` d)) This ought to compile to a single bounds check followed by for memory writes. The user shouldn't have to get more low-level than this in these simple examples. Today this is currently only true for the second example, which we can solve using rewrite rules. The first example doesn't work due to the GHC compiler bug I mentioned. I agree with you that simple uses of the Builder API should be optimized well. I think we can also guide the user indirectly by making writing the efficient code even easier than writing the possibly dangerous one. I'm thinking about providing type-classes for standard encodings. For example, class Utf8 a where utf8 :: a - Builder instance Utf8 String where utf8 = fromWriteList writeCharUtf8 -- further instances: Char, Text This yields another way of navigating around difficult optimization territory. Simon, is the reason for this duplication this comment on top of Blaze.ByteString.Builder.Word? snip That's one of the reasons, but not the main one. The core reason is that Write's provide an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through (streams of) Haskell values. For simple traversals like fromWrite :: Write a - a - Builder fromWriteList :: Write a - [a] - Builder fromWriteUnfoldr :: Write b - (a - Maybe (b, a)) - a - Builder there might be the option that GHC is clever enough and can find the efficient loop. However, for more complicated functions like mapWriteByteString :: Write Word8 - S.ByteString - Builder That certainly isn't the case. I'm using quite a few tricks there [3] to enable a tight inner loop with few live variables. Right. So this argues for having an escape hatch, and I agree we should have one. Write at writeAtMost are both such escape hatches and I believe them to equal in expressiveness. This shouldn't come as a surprise as Write is writeAtMost with one argument reified into into a constructor field: writeAtMost :: Int - (Ptr Word8 - IO Int) - IO () data Write = {-# UNPACK #-} !Int (Ptr Word8 - IO (Ptr Word8)) (That the second argument of writeAtMost is an Int instead of a Ptr Word8 as in Write is an unimportant difference.) There, seems to be a historical artefact here. The new Write abstraction in system-io-write is different from the one used in blaze-builder. It's type is data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8)) This definition ensures that the bound on the number of bytes written is independent of the value being encoded. That's crucial for the implementation of `mapWriteByteString`. It also benefits the other Write combinators, as the bound can always be computed in a data-independent fashion. Inlining, is therefore really sufficient to arrive at a constant bound during compile time. I don't see how this Write type can be emulated using `writeAtMost`, do you? There are some operational differences. * The argument to Write can be inspected at runtime, while the argument to writeAtMost can only be inspected at compile time (by a rewrite rule). * Write might exist at runtime, if it's allocation site cannot be seen by its use site, which hard to guarantee in general (it requires serious staring at Core). This is not the case for writeAtMost, unless
Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform
Hi Antoine, thanks for your feedback. 2011/5/18 Antoine Latter aslat...@gmail.com: On Wed, May 18, 2011 at 12:32 PM, Simon Meier iridc...@gmail.com wrote: Hello Haskell-Cafe, There are many providers of Writes. Each bounded-length-encoding of a standard Haskell value is likely to have a corresponding Write. For example, encoding an Int32 as a big-endian, little-endian, and host-endian byte-sequence is currently achieved with the following three functions. writeInt32BE :: Write Int32 writeInt32LE :: Write Int32 writeInt32HE :: Write Int32 I would like to avoid naming all these encodings individually. Especially, as the situation becomes worse for more elaborate encodings like hexadecimal encodings. There, we encounter encodings like the utf8-encoding of the hexadecimal-encoding with lower-case letters of an Int32. writeInt32HexLowerUtf8 :: Write Int32 I really don't like that. Therefore, I'm thinking about the following solution based on type-classes. We introduce a single typeclass class Writable a where write :: Write a and use a bunch of newtypes to denote our encodings. newtype Ascii7 a = Ascii7 { unAscii7 :: a } newtype Utf8 a = Utf8 { unUtf8 :: a } newtype HexUpper a = HexUpper { unHexUpper :: a } newtype HexLower a = HexLower { unHexLower :: a } ... Assuming FlexibleInstnaces, we can write encodings like the above hex-encoding as instances instance Write (Utf8 (HexLower Int32)) where write = ... This composes rather nicely and allows the implementations to exploit special properties of the involved data. For example, if we also had a HTML escaping marker newtype Html a = Html { unHtml :: a } Then, the instance instance Write (Utf8 (HTML (HexLower Int32))) where write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i)) If I were authoring the above code, I don't see why that code is any easier to write or easier to read than: urf8HtmlHexLower i = utf8HexLower i And if I were using the encoding functions, I would much prefer to see: urf8HtmlHexLower magicNumber In my code, instead of: write $ Utf8 $ HTML $ HexLower magicNumber In addition, this would be difficult for me as a developer using the proposed library, because I would have no way to know which combinations of newtypes are valid from reading the haddocks. Maybe I'm missing something fundamental, but this approach seems more cumbersome to me as a library author (more boilerplate) and as the user of the library (less clarity in the docs and in the resultant code). Hmm, that's a valid point you raise here. Especially, the documentation issue bothers me. The core problem that drove me towards this solution is the abundance of different IntX and WordX types. Each of them requiring a separate Write for big-endian, little-endian, host-endian, lower-case-hex, and uper-case-hex encodings; i.e., currently, there are int8BE :: Write Int8 int16BE :: Write Int16 int32BE :: Write Int32 ... hexLowerInt8 :: Write Int8 ... and so on. As you can see (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html) this approach clutters the public API quite a bit. Hence, I'm thinking of using a separate type-class for each encoding; i.e., class BigEndian a where bigEndian :: Write a This collapses the big-endian encodings of all 10 bounded-size (signed and unsigned) integer types under a single name with a well-defined semantics. Moreover, it's standard Haskell 98. For the hex-encodings, I'm thinking about providing type-classes class HexLower a where hexLower :: Write a class HexLowerNoLead a where hexLowerNoLead :: Write a ... for ASCII encoding and each of the standard Unicode encodings in a separate module. The user can then select the right ones using qualified imports. In most cases, he won't even need qualification, as mixing different character encodings is seldomly used. What do you think about such an interface? Is there another catch hidden, I'm not seeing? BTW, note that Writes are a pure compile time abstraction and are thought to be completely inlined. In typical, uses cases there's no efficiency overhead stemming from these typeclasses. best regards, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform
Hi Johan, thanks for the extensive and motivating feedback. 2011/5/19 Johan Tibell johan.tib...@gmail.com: On Wed, May 18, 2011 at 7:32 PM, Simon Meier iridc...@gmail.com wrote: In fact, one of my current goals with this work is to polish it such that it can be integrated into the 'bytestring' library. We should definitely add a builder monoid in the bytestring package. Since Write mentions IO, I thought I should point out that we need to separate any code that mentions IO from the the code that doesn't (i.e. the pure Builder API). The use of IO is an implementation detail in bytestring. We should follow the existing bytestring pattern and put any code that mentions IO in e.g. Data.ByteString.Lazy.Builder.Internal. This allows the few people who need to access the internals to do so while making it clear that these are in fact internals. Long term we'd like to switch bytestring over from ForeignPtr to ByteArray#, if possible. There are currently some technical obstacles to such a switch, but factoring out the IO code at least makes it somewhat easier if we ever get around to switching. Avoiding IO in the main API means that the main builder type must not mention IO (or things related to IO, such as Storable). I completely agree with you. The system-io-write library [1] and the bytestring fork [2] I'm working on provide separate interfaces for standard and expert users. The naming of the system-io-write library is tentative and can be adapted once it's place is clear. The core principle used to tackle (1) is avoiding intermediate data structures. The core abstraction used is the one of a Write (see [1] for the corresponding library.) data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8)) A value `Write bound io :: Write a` denotes an encoding scheme for values of type `a` that uses at most `bound` bytes space. Given a values `x :: a` and a pointer `po` to the next free byte `io x po` encodes `x` to memory starting from `po` and returns the pointer to the next free byte after the encoding of `x`. In most cases Writes are used as an abstract datatype. They serve as an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through streams of Haskell values. Hence, typical users of Writes are functions like fromWrite :: Write a - a - Builder fromWriteList :: Write a - [a] - Builder fromWriteUnfoldr :: Write b - (a - Maybe (b, a)) - a - Builder mapWriteByteString :: Write Word8 - S.ByteString - Builder We want to allow users to efficiently create new builders, for their own data type. This is crucial as the bytestring package cannot provide efficient builders for every possible type, as it would have to depend on most of Hackage (i.e. on all packages that define types that we want efficient builders for) to do so. Allowing the user to get hold of the underlying buffer in a controlled way makes the builder extensible. This is good. Write achieves this separation, but it has some costs which I'm not entirely comfortable with. First, it leads to lots of API duplication. For every type (e.g. Word) we want to be able serialize we have two morally identical functions writeWordhost :: Word - Write fromWordhost :: Word - Builder in the API, where the latter simply calls the former and does some additional wrapping. Yes, I agree with this duplication. I'll explain below what we gain from it. Note that I factored out the whole Write stuff into its own library (system-io-write) for the bytestring integration. Therefore, an end-user of bytestring will only see the Builder versions except he's doing more low-level stuff to gain some extra performance. See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html for examples. Simon, is the reason for this duplication this comment on top of Blaze.ByteString.Builder.Word? Note that for serializing a three tuple (x,y,z) of bytes (or other word values) you should use the expression fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z instead of fromWord8 x `mappend` fromWord8 y `mappend` fromWord z The first expression will result in a single atomic write of three bytes, while the second expression will check for each byte, if there is free space left in the output buffer. Coalescing these checks can improve performance quite a bit, as long as you use it sensibly. That's one of the reasons, but not the main one. The core reason is that Write's provide an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through (streams of) Haskell values. For simple traversals like fromWrite :: Write
[Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform
Hello Haskell-Cafe, my main question is whether requiring FlexibleInstances is a problem for code that aims to become part of the Haskell platform. The following explanation gives the context for this question. As some of you may know the blaze-builder library is now used in quite a few places. That's nice, but it doesn't mean that blaze-builder is a finished solution to the problem of providing an API for high-performance buffered output (creation of chunked representations) of sequences of bytes. In fact, one of my current goals with this work is to polish it such that it can be integrated into the 'bytestring' library. This has the benefit that functions that create lazy bytestrings (e.g., pack, map, unfoldr, filter) can be implemented such that they create well-sized chunks even if the argument bytestring is hugely fragmented. Moreover, this integration also establishes a single builder type as the output representation. Therefore, other creators of bytestrings (e.g., 'text', 'base16-bytestring', 'zlib') can provide results of type Builder, which enables O(1) appends of their results and the preservation of well-sizedness of the created chunks. As part of this goal, I'm currently working on the first of the following three points that are paramount to achieving great encoding performance: 1. Ensure that individual Haskell values are encoded with minimal overhead. 2. Ensure that concatenation of sequences of bytes is efficient. 3. Ensure that the average chunk size is large. The core principle used to tackle (1) is avoiding intermediate data structures. The core abstraction used is the one of a Write (see [1] for the corresponding library.) data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8)) A value `Write bound io :: Write a` denotes an encoding scheme for values of type `a` that uses at most `bound` bytes space. Given a values `x :: a` and a pointer `po` to the next free byte `io x po` encodes `x` to memory starting from `po` and returns the pointer to the next free byte after the encoding of `x`. In most cases Writes are used as an abstract datatype. They serve as an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through streams of Haskell values. Hence, typical users of Writes are functions like fromWrite :: Write a - a - Builder fromWriteList :: Write a - [a] - Builder fromWriteUnfoldr :: Write b - (a - Maybe (b, a)) - a - Builder mapWriteByteString :: Write Word8 - S.ByteString - Builder They consume the given datastructure efficiently and wrap the Writes in the bounds checking code to detect when a buffer is full and request a new one. There are many providers of Writes. Each bounded-length-encoding of a standard Haskell value is likely to have a corresponding Write. For example, encoding an Int32 as a big-endian, little-endian, and host-endian byte-sequence is currently achieved with the following three functions. writeInt32BE :: Write Int32 writeInt32LE :: Write Int32 writeInt32HE :: Write Int32 I would like to avoid naming all these encodings individually. Especially, as the situation becomes worse for more elaborate encodings like hexadecimal encodings. There, we encounter encodings like the utf8-encoding of the hexadecimal-encoding with lower-case letters of an Int32. writeInt32HexLowerUtf8 :: Write Int32 I really don't like that. Therefore, I'm thinking about the following solution based on type-classes. We introduce a single typeclass class Writable a where write :: Write a and use a bunch of newtypes to denote our encodings. newtype Ascii7 a = Ascii7 { unAscii7 :: a } newtype Utf8 a = Utf8 { unUtf8 :: a } newtype HexUpper a = HexUpper { unHexUpper :: a } newtype HexLower a = HexLower { unHexLower :: a } ... Assuming FlexibleInstnaces, we can write encodings like the above hex-encoding as instances instance Write (Utf8 (HexLower Int32)) where write = ... This composes rather nicely and allows the implementations to exploit special properties of the involved data. For example, if we also had a HTML escaping marker newtype Html a = Html { unHtml :: a } Then, the instance instance Write (Utf8 (HTML (HexLower Int32))) where write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i)) exploits that no HTML escaping is required for a hex-number. Assuming FlexibleContexts, the user can also build abbreviations for builders using fixed encodings. utf8 :: Writable (Utf8 a) = a - Builder utf8 = fromWrite write . Utf8 Note that, on the Builder level, a probably better way would be to have an analogous 'ToBuilder' typeclass to abstract the various encodings. Part of these instances then reuse the corresponding instances from Writable. I think this type-class based interface to select the correct efficient implementation of an