Re: [Haskell-cafe] lazy boxed array and builder?

2012-07-16 Thread Simon Meier
Hi Bas,

I'm not sure the unpacking is absolutely necessary. It might be worth
to give it a try with not-unpacked strict chunks. In some of my
ByteString builder experiments, I even got better performance by not
unpacking the ByteStrings in some of the intermediate data structures.
My gut feeling says that compensating the extra indirection is not too
hard provided you ensure that your chunks are large enough. The corner
case of having lots of small lazy bytestrings is likely to be slower.
Experiments will tell.

best regards,
Simon

2012/7/12 Bas van Dijk v.dijk@gmail.com:
 On 12 July 2012 15:35, Yves Parès yves.pa...@gmail.com wrote:
 I remember this discussion, lazy vectors would also enable an implementation
 of bytestring and (maybe) text only with unboxed vectors, unifying it all:
 type ByteString = Vector Word8

 Yes, I would like to add a lazy storable vector to
 vector-bytestring[1] to make the API 100% consistent with bytestring.

 Ideally we would have a type like:

 data Lazy vector a = Empty | Chuck {-# UNPACK #-} !(vector a) (Lazy vector a)

 Unfortunately GHC can't unpack polymorphic fields. The next best thing
 is to use a type family which for each type of vector would return its
 lazy version (where the vector is unpacked in the cons cell). Then we
 would need a class for common operations on those lazy vectors.

 Regards,

 Bas

 [1] http://hackage.haskell.org/package/vector-bytestring
 https://github.com/basvandijk/vector-bytestring

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Hackage 2 maintainership

2012-03-04 Thread Simon Meier
Hi Duncan,

I just wanted to thank you and all the other guys pushing Hackage 2
towards a public release. I just tested the

http://hackage.factisresearch.com/

instance and it's blazingly fast. Cool stuff! The reverse dependencies
are also very useful. I know that sending patches instead of thanks
would help more. I'll try that in a (hopefully) not too distant future
:-)

best regards and thanks again,
Simon


2012/2/14 Duncan Coutts duncan.cou...@googlemail.com:
 Hi Ben,

 On 13 February 2012 23:44, Ben Gamari bgamari.f...@gmail.com wrote:
 Hey all,

 Those of you who follow the Haskell subreddit no doubt saw today's post
 regarding the status of Hackage 2. As has been said many times in the
 past, the primary blocker at this point to the adoption of Hackage 2
 appears to be the lack of an administrator.

 Yes, much of it is lack of an individual to keep momentum up and keep
 everyone else motivated. While I'm keen that hackage moves forward, my
 volunteer time is spread too thin to be that person keeping everything
 organised. That said, where I spend my volunteer time is to a large
 part directed by what other people are doing, it's much more fun and
 motivating if there's other people working with you.

 Speaking of which, I spent much of this evening fixing things, more
 details below.

 It seems to me this is a poor reason for this effort to be held
 up. Having taken a bit of time to consider, I would be willing to put in
 some effort to get things moving and would be willing to maintain the
 haskell.org Hackage 2.0 instance going forward if necessary.

 That would be great. So in the short term I'm very happy to get the
 help and in the longer term I'm happy to hand over to anyone sensible
 who puts in the effort. That person could be you, someone else or a
 team of several people.

 More immediately, my general policy with commit access is to give it
 to anyone who's sent a few good patches. Currently there are 7 people
 with write access to the darcs repo on code.h.o. It is of course also
 fine for people to maintain their own public branches (which they can
 do using git or darcs, whichever).

 I currently have a running installation on my personal machine and
 things seem to be working as they should. On the whole, installation was
 quite trivial, so it seems likely that the project is indeed at a point
 where it can take real use (although a logout option in the web
 interface would make testing a bit easier).

 Yes, we're at the stage where we can run a public testing instance.
 You'll see there's a bit more to implement and test for a switchover.

 That being said, it would in my opinion be silly to proceed without
 fixing the Hackage trac. It was taken down earlier this year due to
 spamming[1] and it seems the recovery project has been orphaned. I would
 be willing to help with this effort, but it seems like the someone more
 familiar with the haskel.org infrastructure might be better equipped to
 handle the situation.

 I spent a couple hours on this this evening and I've finally fixed it
 (I hope). I still need to purge a bit of wiki/ticket spam (help
 apreciated there). Sadly I've had to blow away the previos login
 accounts, but I've semi-restored by copying the ghc trac accounts. So
 if you happen to have an account on the ghc trac, then your login
 should work for the hackage trac. Otherwise you'll need to re-register
 as if it was a new account.

 It seems that this process will go something like this,
  1) Bring Hackage trac back from the dead

 Check.

  2) Bring up a Hackage 2 instance along-side the existing
     hackage.haskell.org

 Yes, now that the trac is back, you can see what notes we have on the
 switchover process at:
 http://hackage.haskell.org/trac/hackage/wiki/HackageDB/2.0

 Note also that the nice people at factisresearch.com have given us a
 VM with enough memory (8GB) for the purpose of running a public test
 with the full package set (in principle it should not need so much
 memory, but we currently keep unnecessary package metadata in memory).

 So thanks to you and others this evening motivating me, I've also
 taken Max's latest patches to my tar package (which coincidentally I
 released yesterday) and the corresponding hackage-server patch and set
 it running at:

 http://hackage.factisresearch.com/

 This is running the latest upstream darcs version. I have also fired
 off a one-shot mirroring operation. This will mirror all the existing
 packages from hackage. It'll probably take half a day or so to
 complete since there's something like 30-40k tarballs to be copied
 over. I'll check the logs tomorrow hopefully and after that kick off a
 live/continuous mirror so it'll get new updates from the main hackage
 within 20-30 min or so.

 Last time Max and I tried this we were able to mirror almost all
 packages. Most of the unmirrorable ones at the time were due to
 packages with quirks in their tar format, which is what his tar
 patches were aimed at. So I'm 

Re: [Haskell-cafe] List x ByteString x Lazy Bytestring

2011-12-06 Thread Simon Meier
Hi John,

  I've used Haskell and GHC to solve particular real life application. 4
    tools were developed and their function is almost the same - they
    modify textual input according to patterns found in the text. Thus, it

 Hmm, modification can be a problem for ByteStrings, since it entails
 copying. That could be worse for strict BytStrings than lazy, if in the
 lazy ByteString you can reuse many chunks.

 I understand now, that is probably the point.



 Two main possibilities:
 1. your algorithm isn't suited for ByteStrings
 2. you're doing it wrong

 The above indicates 1., but without a more detailed description and/or
 code, it's impossible to tell.


 Yes, it seems that the (1) is the point, because I split and re-build
 the bytestream many times during processing.

For splitting it might be interesting to have a look at 'attoparsec'
(http://hackage.haskell.org/package/attoparsec), which are parser
combinators specialized to bytestring's. If the splitting still leaves
large enough chunks of the original input intact (large enough being
around  128 bytes), then you might be able to achieve a benefit.

To use strict and lazy bytestrings efficiently, it helps a lot to know
their internal representation (a slice of a char[] array and lists of
slices of char[] arrays) and to think about what the different
operations cost on this datastructure. (The source code of the library
is actually not that hard to understand.) An obviously very expensive
operation is concatenation (guaranteed copying for strict bytestrings
at least a list traversal for lazy bytestrings). The blaze-builder
library provides a type that supports O(1) concatenation of sequences
of bytes and an efficient conversion to a lazy bytestring that has a
large average chunk size. Large chunks are important to amortize the
work spent on the boundary with the speedup gained due to the compact
and cache efficient representation.

If you keep using lists, but need a lot of concatenation using
difference lists (http://hackage.haskell.org/package/dlist) also
allows a O(1) concatenation. Note that the cost of having this O(1)
concatenation is that the resulting list cannot be inspected, as long
as it is in a form that supports O(1) concatenation. The same holds
for the lazy bytestring builder provided by blaze-builder.

Note that I'd be very curious, if you could achieve any further
performance improvement using blaze-builder...which is no surprise, as
I'm its author :-) Note also that the API has been cleaned up and will
be published with the next release of the bytestring library. So just
ignore the whole 'write' stuff. I should have separated it.

good luck and best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Attoparsec concatenating combinator

2011-06-07 Thread Simon Meier
2011/6/6 Bryan O'Sullivan b...@serpentine.com:
 On Sun, Jun 5, 2011 at 11:00 AM, Yitzchak Gale g...@sefer.org wrote:

 If behind the scenes the concat is copying directly from slices of the
 original
 input, then no, in principle we're not saving much then.
 I thought there were *two* copies going on.

 If you're using the specialised functions like attoparsec's takeWhile, then
 all they do is return a view into the underlying array. No copying occurs
 until the concat itself. Now that I think of it: in principle, you could
 write a specialised concat that would check the pointer/offset/length
 combinations of its arguments and, if they all abutted perfectly, would just
 return a new view into that same array, sans copying. (You'd have to hide it
 behind unsafePerformIO, of course.)

Why would you need 'unsafePerformIO'. You can scrutinise the 'PS'
constructors of the slice without dropping down to IO. The required
'Eq' instance on 'ForeignPtr' is also present.

Using a Builder for concatentation makes sense, if you want to exploit
that copying a slice of the input array is cheaper right after it has
been inspected (its fully cached) than later (as it is done when
collecting slices in a list). However, you can only have one Builder
at a time and some low-level meddling is probably required to
interleave the feeding of the Parser with input arrays with the
feeding of the Builder with free buffers. Nevertheless, for something
like parsing Chunked HTTP content it would make a lot of sense. I'm
inclined look into that once I finished porting the blaze Builder to
the 'bytestring' library.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Attoparsec concatenating combinator

2011-06-07 Thread Simon Meier
2011/6/7 Bryan O'Sullivan b...@serpentine.com:
 On Tue, Jun 7, 2011 at 1:40 AM, Simon Meier iridc...@gmail.com wrote:

 Why would you need 'unsafePerformIO'. You can scrutinise the 'PS'
 constructors of the slice without dropping down to IO.

 True. Oops :-)


 Using a Builder for concatentation makes sense, if you want to exploit
 that copying a slice of the input array is cheaper right after it has
 been inspected (its fully cached) than later (as it is done when
 collecting slices in a list).

 When I've measured this in the past, I've found that it's often faster to
 accumulate a list and then run concat at the end than to use blaze-builder
 directly. That was certainly the case wit GHC 6.12; I haven't remeasured
 with 7.0. That's why you'll see that some places in the aeson JSON library
 use blaze-builder, while others manipulate bytestrings directly.

When creating a Builder that you run afterwards, then you essentially
create a list of bytestring concatenations as a closure. It makes
sense that you don't win with such an approach, as the concatenation
still only happens after then end of this list is reached. What you'd
need is to nest attoparsec's Parser monad with the 'Put' monad
provided in the blaze-builder internals [1]. However, I'm not yet sure
how to achieve this in a modular fashion. Perhaps, using iteratee's
the right way might provide a good answer, but that's just guesswork.

The performance characteristics of blaze-builder are such that for
short output sequences working with bytestrings directly is sometimes
favorable, as the setup overhead before the Builder can be executed
needs to be amortized. However, Builders work with a single pass
through the input data, while many bytestring operations require two
passes. For example, 'Data.ByteString.pack' first determines the
length of the input and only afterwards writes the data to memory.
Using blaze-builder's 'fromWord8s' requires only a single traversal
and is faster for lists of length 64 on my 32-bit Core 2 Duo machine.
I expect a similar effect for concatenating lists of strict
bytestrings.

[1] 
http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Internal.html#g:3

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Efficient object identity (aka symbols as data)

2011-05-26 Thread Simon Meier
2011/5/26 Jacek Generowicz jacek.generow...@cern.ch:

 On 2011 May 26, at 11:16, Christopher Done wrote:

 On 26 May 2011 10:45, Jacek Generowicz jacek.generow...@cern.ch wrote:

 What is the Haskell approach to efficient comparison and lookup of
 objects
 by their identity?

 Often you just provide your own and implement Eq.

 I should be able to run the program on data that becomes available at run
 time.

 Typically you define an id generator and zip anything coming from the
 input stream up with that generator.

 Makes sense.

 Whatever algorithm I choose to use for the optimization, will have to do
 lots of comparisons of Groups and Persons where their *identity* is all
 that
 matters: you don't need to look inside the objects.

 To achieve this abstraction the usual way is just implementing Eq:

 instance Eq Person where
  Person{personId=id1} == Person{personId=id2} = id1 == id2

 Any comments on the relative efficiency of the above as compared to

 A == B in the context of

 data Foo = A | B | C | D | ... lots more ...

 ?

 (I imagine that a Sufficiently Smart Compiler could reduce (==) :: Person
 Person to just integer comparison.)

I'm pretty sure GHC will do that for you.

An approach similar to the one by Chris Done is to use a
bi-directional map between Persons and Ints along the lines of

data IdMap a = IdMap (IntMap a) (Map a Int)

You can then associate a unique Int with each of your Persons and use
this during your comparison. For associating Ints to the Persons, a
simple fold or a State monad computation suffice. For the lookups on
the further properties of Persons, an additional argument or the
Reader monad will do. If you use a State monad and a single operation
that associates an Int to a Person, then you additionally get the
guarantee (inside a monadic computation) that no two Persons will be
associated with the same Int.

Efficiency-wise, you'll have O(log(n)) association,  O(min(n,W))
access time, and O(1) comparison time with a very low constant factor.
See the IntMap documentation for the O(min(n,W)) explanation.
Additionally, the code is pure with all the nice properties that come
with it.

By the way this problem is very similar to the one of observable
sharing. See this thread:
http://www.haskell.org/pipermail/haskell-cafe/2008-February/039639.html

best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Policy for taking over a package on Hackage

2011-05-25 Thread Simon Meier
2011/5/25 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com:
 On 25 May 2011 22:17, Stephen Tetley stephen.tet...@gmail.com wrote:
 Hi Ivan

 Forks are good, no?

 The Parsec experience has suggested to me at least, that new author's
 capping another author's work by bumping up to a major version,
 causes a significant difficulties even when the original author has
 gone.

 As for wl-pprint, it was a very tidy library in its original
 implementation - it's a pity it now has name clashes with Applicative.
 My feeling is that a new library in a new namespace with some
 attention to new combinator names would be better.

 Such as?  I'm _hopeless_ at making up names... ;-)

 Having a new package would require a new name and new module
 namespace, let alone thinking up new names for combinators...

 Also, by clashes with Applicative, are you referring to empty and $
 ?  I'm not sure if a better name than empty can be found; as for
 $, maybe using pretty's notation of $$ and $+$ rather than $ and
 $$ ?

What about 'emptyDoc'? Moreover, if you are changing the names of
combinators, then moving them away from Applicative and Arrow would be
a good idea; i.e., don't use +, as it already used by ArrowPlus.
Moreover, if you can make a Monoid instance such that `mappend` equals
, you would also make the library compatible to a future
introduction of () = mappend.

best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Simon Meier
2011/5/19 Antoine Latter aslat...@gmail.com:
 On Thu, May 19, 2011 at 3:06 PM, Simon Meier iridc...@gmail.com wrote:

 The core problem that drove me towards this solution is the abundance
 of different IntX and WordX types. Each of them requiring a separate
 Write for big-endian, little-endian, host-endian, lower-case-hex, and
 uper-case-hex encodings; i.e., currently, there are

 int8BE   :: Write Int8
 int16BE :: Write Int16
 int32BE :: Write Int32
 ...
 hexLowerInt8 :: Write Int8
 ...

 and so on. As you can see
 (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html)
 this approach clutters the public API quite a bit. Hence, I'm thinking
 of using a separate type-class for each encoding; i.e.,


 If Johan's work on Data.Binary and rewrite rules works out, then it
 would cut the exposed API in half, which helps.

 We could then use the module and package system to further keep the
 API clean, with builders which output a specific encoding could live
 in separate modules. This could also keep the names of the functions
 short, as well.

 That would require coming up with logical divisions for the functions
 you're creating, and I don't understand the big picture enough to help
 with that.

  class BigEndian a where
    bigEndian :: Write a

 This collapses the big-endian encodings of all 10 bounded-size (signed
 and unsigned) integer types under a single name with a well-defined
 semantics. Moreover, it's standard Haskell 98. For the hex-encodings,
 I'm thinking about providing type-classes

  class HexLower a where
    hexLower :: Write a

  class HexLowerNoLead a where
    hexLowerNoLead :: Write a

  ...

 for ASCII encoding and each of the standard Unicode encodings in a
 separate module. The user can then select the right ones using
 qualified imports. In most cases, he won't even need qualification, as
 mixing different character encodings is seldomly used.


 I think we may be at cross-purposes here, and might not even be
 discussing the same thing - I would imagine that any sort of 'Builder'
 type included in the bytestring package would only provide the core
 combinators for packing data into low-level binary formats, so
 discussions about text encoding issues, converting to hexidecimal and
 Html escaping are going above my head.

 This seems like what the 'text' package was written for - to separate
 out the construction of textual data from choosing its encoding.

 Are there use-cases where the 'text' package is too slow for this sort
 of approach?

 Take care,
 Antoine

 What do you think about such an interface? Is there another catch
 hidden, I'm not seeing? BTW, note that Writes are a pure compile time
 abstraction and are thought to be completely inlined. In typical, uses
 cases there's no efficiency overhead stemming from these typeclasses.

 best regards,
 Simon



Yes, for example using the current 'text' package is sup-optimal for
dyamically generating UTF-8 encoded HTML pages. The job is simple: the
data which is originally held in standard Haskell types (e.g., String)
needs to be HTML escaped and UTF-8 encoded and sprinkled with tags in
between.

For blaze-html using blaze-builder the cost for a tag is a memcpy of
the corresponding tag and the cost for a single character is one call
to the nested case statement determining if the char needs to be
escaped (one memcpy of its escaped version) or what bytes need to be
written for UTF-8 encoding the char. This solution works with a single
output buffer.

For a solution using the text library the cost of creating the
underlying UTF-16 array is similar to the cost for blaze-builder.
However, you now also need to UTF-8 encode the UTF-16 array. This
costs you more than double, as now you also have to inspect every
character of every tag. For ~50% of your data you suddenly have to
spend a lot more effort!

I agree that the text library is a good choice for representing
Unicode data of an application. However, for high-performance
applications it pays off to think of its output in binary form and
exploit the offered shortcuts. That's where blaze-builder and the like
come in.

thanks for your input,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Simon Meier
2011/5/20 Bas van Dijk v.dijk@gmail.com:
 On 19 May 2011 10:53, Johan Tibell johan.tib...@gmail.com wrote:
 Long term we'd like to switch bytestring over
 from ForeignPtr to ByteArray#, if possible. There are currently some
 technical obstacles to such a switch

 BTW I'm working with Roman Leshchinskiy to create the
 vector-bytestring package which provides:
 type ByteString = Data.Vector.Storable.Vector Word8
 and exports the same API as the bytestring package (no support for
 lazy bytestrings yet)

 A storable vector still uses a ForeignPtr but maybe this will make the
 switch to unboxed Vectors (which use ByteArray#) easier.

 Expect some code to be up somewhere next week.

 Bas


OK, that sounds interesting. I'm looking forward to your code.

One trick, I'm using in the blaze-builder implementation is that the
current buffer is denoted by a region (pf :: Ptr Word8, pe :: Ptr
Word8) of the next free byte `pf` and the first byte after the buffer
`pe`. This only works for pinned ByteArrays/ForeignPtrs. Do you know
what the cost of such an array/foreignptr is? Moreover, after creation
I could unpin the array. Do you perhaps know if thats possible in
principle?

best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Simon Meier
2011/5/20 Henning Thielemann schlepp...@henning-thielemann.de:
 Simon Meier schrieb:

 There are many providers of Writes. Each bounded-length-encoding of a
 standard Haskell value is likely to have a corresponding Write. For
 example, encoding an Int32 as a big-endian, little-endian, and
 host-endian byte-sequence is currently achieved with the following
 three functions.

   writeInt32BE :: Write Int32
   writeInt32LE :: Write Int32
   writeInt32HE :: Write Int32

 I would like to avoid naming all these encodings individually.

 Maybe this one helps:
 http://hackage.haskell.org/packages/archive/storable-endian/0.2.4/doc/html/Data-Storable-Endian.html
 ?


Thanks. I didn't know of that package. It for sure provides some
valuable input. Moreover, it pointed me to Antoine Latter's byteorder
package (http://hackage.haskell.org/package/byteorder). I might use
it, although I would rather have the ByteOrder determined at compile
time. Shouldn't GHC provide some means to determine this, as it's
compiling for a fixed architecture anyway? What experience have you
made Antoine when you implemented this package?

best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Simon Meier
2011/5/20 Johan Tibell johan.tib...@gmail.com:
 Hi Simon,

 On Thu, May 19, 2011 at 10:46 PM, Simon Meier iridc...@gmail.com wrote:
 Write achieves this separation, but it has some costs which I'm not
 entirely comfortable with.

 First, it leads to lots of API duplication. For every type (e.g. Word)
 we want to be able serialize we have two morally identical functions

    writeWordhost :: Word - Write
    fromWordhost :: Word - Builder

 in the API, where the latter simply calls the former and does some
 additional wrapping.

 Yes, I agree with this duplication. I'll explain below what we gain
 from it. Note that I factored out the whole Write stuff into its own
 library (system-io-write) for the bytestring integration. Therefore,
 an end-user of bytestring will only see the Builder versions except
 he's doing more low-level stuff to gain some extra performance.

 There are (at least) two cases where I think the simple Builder API
 must perform well for it to be usable on its own: simple loops and
 sequential writes. To be specific, if the following two cases don't
 compile into near optimal code, there's a compiler bug we should fix.
 First, a simple loop:

    f :: [Word8] - Builder
    f [] = mempty
    f (x:xs) = singleton x `mappend` xs

 This code is already quite low level, there should be enough
 information here for the compiler to emit a simple loop with one
 buffer bounds check per iteration. Second, a bunch of sequential
 writes:

    g :: Word8 - Word8 - Word8 - Word8 - Builder
    g a b c d = singleton `mappend` (b `mappend` (c `mappend` d))

 This ought to compile to a single bounds check followed by for memory writes.

 The user shouldn't have to get more low-level than this in these
 simple examples. Today this is currently only true for the second
 example, which we can solve using rewrite rules. The first example
 doesn't work due to the GHC compiler bug I mentioned.

I agree with you that simple uses of the Builder API should be
optimized well. I think we can also guide the user indirectly by
making writing the efficient code even easier than writing the
possibly dangerous one. I'm thinking about providing type-classes for
standard encodings. For example,

class Utf8 a where
utf8 :: a - Builder

instance Utf8 String where
utf8 = fromWriteList writeCharUtf8

-- further instances: Char, Text

This yields another way of navigating around difficult optimization territory.

 Simon, is the reason for this duplication this comment on top of
 Blaze.ByteString.Builder.Word?

    snip

 That's one of the reasons, but not the main one. The core reason is
 that Write's provide
 an interface between implementors of the low-level bit-twiddling
 required to efficiently implement encodings like UTF-8 or Base16 and
 the providers of efficient traversal functions through (streams of)
 Haskell values. For simple traversals like

   fromWrite          :: Write a - a - Builder
   fromWriteList      :: Write a - [a] - Builder
   fromWriteUnfoldr   :: Write b - (a - Maybe (b, a)) - a - Builder

 there might be the option that GHC is clever enough and can find the
 efficient loop. However, for more complicated functions like

   mapWriteByteString :: Write Word8 - S.ByteString - Builder

 That certainly isn't the case. I'm using quite a few tricks there [3]
 to enable a tight inner loop with few live variables.

 Right. So this argues for having an escape hatch, and I agree we
 should have one. Write at writeAtMost are both such escape hatches and
 I believe them to equal in expressiveness. This shouldn't come as a
 surprise as Write is writeAtMost with one argument reified into into a
 constructor field:

    writeAtMost :: Int - (Ptr Word8 - IO Int) - IO ()
    data Write = {-# UNPACK #-} !Int (Ptr Word8 - IO (Ptr Word8))

 (That the second argument of writeAtMost is an Int instead of a Ptr
 Word8 as in Write is an unimportant difference.)

There, seems to be a historical artefact here. The new Write
abstraction in system-io-write is different from the one used in
blaze-builder. It's type is

  data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8))

This definition ensures that the bound on the number of bytes written
is independent of the value being encoded. That's crucial for the
implementation of `mapWriteByteString`. It also benefits the other
Write combinators, as the bound can always be computed in a
data-independent fashion. Inlining, is therefore really sufficient to
arrive at a constant bound during compile time.

I don't see how this Write type can be emulated using `writeAtMost`, do you?

 There are some operational differences.

 * The argument to Write can be inspected at runtime, while the
 argument to writeAtMost can only be inspected at compile time (by a
 rewrite rule).

 * Write might exist at runtime, if it's allocation site cannot be seen
 by its use site, which hard to guarantee in general (it requires
 serious staring at Core). This is not the case for writeAtMost, unless

Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Simon Meier
Hi Antoine, thanks for your feedback.

2011/5/18 Antoine Latter aslat...@gmail.com:
 On Wed, May 18, 2011 at 12:32 PM, Simon Meier iridc...@gmail.com wrote:
 Hello Haskell-Cafe,





 There are many providers of Writes. Each bounded-length-encoding of a
 standard Haskell value is likely to have a corresponding Write. For
 example, encoding an Int32 as a big-endian, little-endian, and
 host-endian byte-sequence is currently achieved with the following
 three functions.

  writeInt32BE :: Write Int32
  writeInt32LE :: Write Int32
  writeInt32HE :: Write Int32

 I would like to avoid naming all these encodings individually.
 Especially, as the situation becomes worse for more elaborate
 encodings like hexadecimal encodings. There, we encounter encodings
 like the utf8-encoding of the hexadecimal-encoding with lower-case
 letters of an Int32.

  writeInt32HexLowerUtf8 :: Write Int32

 I really don't like that. Therefore, I'm thinking about the following
 solution based on type-classes. We introduce a single typeclass

  class Writable a where
      write :: Write a

 and use a bunch of newtypes to denote our encodings.

  newtype Ascii7   a = Ascii7   { unAscii7   :: a }
  newtype Utf8     a = Utf8     { unUtf8     :: a }
  newtype HexUpper a = HexUpper { unHexUpper :: a }
  newtype HexLower a = HexLower { unHexLower :: a }
  ...

 Assuming FlexibleInstnaces, we can write encodings like the above
 hex-encoding as instances

  instance Write (Utf8 (HexLower Int32)) where
    write = ...

 This composes rather nicely and allows the implementations to exploit
 special properties of the involved data. For example, if we also had a
 HTML escaping marker

  newtype Html     a = Html     { unHtml     :: a }

 Then, the instance

  instance Write (Utf8 (HTML (HexLower Int32))) where
    write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i))

 If I were authoring the above code, I don't see why that code is any
 easier to write or easier to read than:

 urf8HtmlHexLower i = utf8HexLower i

 And if I were using the encoding functions, I would much prefer to see:

 urf8HtmlHexLower magicNumber

 In my code, instead of:

 write $ Utf8 $ HTML $ HexLower magicNumber

 In addition, this would be difficult for me as a developer using the
 proposed library, because I would have no way to know which
 combinations of newtypes are valid from reading the haddocks.

 Maybe I'm missing something fundamental, but this approach seems more
 cumbersome to me as a library author (more boilerplate) and as the
 user of the library (less clarity in the docs and in the resultant
 code).

Hmm, that's a valid point you raise here. Especially, the
documentation issue bothers me.

The core problem that drove me towards this solution is the abundance
of different IntX and WordX types. Each of them requiring a separate
Write for big-endian, little-endian, host-endian, lower-case-hex, and
uper-case-hex encodings; i.e., currently, there are

int8BE   :: Write Int8
int16BE :: Write Int16
int32BE :: Write Int32
...
hexLowerInt8 :: Write Int8
...

and so on. As you can see
(http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html)
this approach clutters the public API quite a bit. Hence, I'm thinking
of using a separate type-class for each encoding; i.e.,

  class BigEndian a where
bigEndian :: Write a

This collapses the big-endian encodings of all 10 bounded-size (signed
and unsigned) integer types under a single name with a well-defined
semantics. Moreover, it's standard Haskell 98. For the hex-encodings,
I'm thinking about providing type-classes

  class HexLower a where
hexLower :: Write a

  class HexLowerNoLead a where
hexLowerNoLead :: Write a

  ...

for ASCII encoding and each of the standard Unicode encodings in a
separate module. The user can then select the right ones using
qualified imports. In most cases, he won't even need qualification, as
mixing different character encodings is seldomly used.

What do you think about such an interface? Is there another catch
hidden, I'm not seeing? BTW, note that Writes are a pure compile time
abstraction and are thought to be completely inlined. In typical, uses
cases there's no efficiency overhead stemming from these typeclasses.

best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Simon Meier
Hi Johan,

thanks for the extensive and motivating feedback.

2011/5/19 Johan Tibell johan.tib...@gmail.com:
 On Wed, May 18, 2011 at 7:32 PM, Simon Meier iridc...@gmail.com wrote:
 In fact, one of my current goals with this work is to polish it such
 that it can be integrated into the 'bytestring' library.

 We should definitely add a builder monoid in the bytestring package.

 Since Write mentions IO, I thought I should point out that we need to
 separate any code that mentions IO from the the code that doesn't
 (i.e. the pure Builder API). The use of IO is an implementation detail
 in bytestring. We should follow the existing bytestring pattern and
 put any code that mentions IO in e.g.
 Data.ByteString.Lazy.Builder.Internal. This allows the few people who
 need to access the internals to do so while making it clear that these
 are in fact internals. Long term we'd like to switch bytestring over
 from ForeignPtr to ByteArray#, if possible. There are currently some
 technical obstacles to such a switch, but factoring out the IO code at
 least makes it somewhat easier if we ever get around to switching.

 Avoiding IO in the main API means that the main builder type must not
 mention IO (or things related to IO, such as Storable).

I completely agree with you. The system-io-write library [1] and the
bytestring fork [2] I'm working on provide separate interfaces for
standard and expert users. The naming of the system-io-write library
is tentative and can be adapted once it's place is clear.

 The core principle used to tackle (1) is avoiding intermediate data
 structures.  The core abstraction used is the one of a Write (see [1]
 for the corresponding library.)

  data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8))

 A value `Write bound io :: Write a` denotes an encoding scheme for
 values of type `a` that uses at most `bound` bytes space. Given a
 values `x :: a` and a pointer `po` to the next free byte `io x po`
 encodes `x` to memory starting from `po` and returns the pointer to
 the next free byte after the encoding of `x`.

 In most cases Writes are used as an abstract datatype. They serve as
 an interface between implementors of the low-level bit-twiddling
 required to efficiently implement encodings like UTF-8 or Base16 and
 the providers of efficient traversal functions through streams of
 Haskell values. Hence, typical users of Writes are functions like

  fromWrite          :: Write a - a - Builder
  fromWriteList      :: Write a - [a] - Builder
  fromWriteUnfoldr   :: Write b - (a - Maybe (b, a)) - a - Builder
  mapWriteByteString :: Write Word8 - S.ByteString - Builder

 We want to allow users to efficiently create new builders, for their
 own data type. This is crucial as the bytestring package cannot
 provide efficient builders for every possible type, as it would have
 to depend on most of Hackage (i.e. on all packages that define types
 that we want efficient builders for) to do so. Allowing the user to
 get hold of the underlying buffer in a controlled way makes the
 builder extensible. This is good.

 Write achieves this separation, but it has some costs which I'm not
 entirely comfortable with.

 First, it leads to lots of API duplication. For every type (e.g. Word)
 we want to be able serialize we have two morally identical functions

    writeWordhost :: Word - Write
    fromWordhost :: Word - Builder

 in the API, where the latter simply calls the former and does some
 additional wrapping.

Yes, I agree with this duplication. I'll explain below what we gain
from it. Note that I factored out the whole Write stuff into its own
library (system-io-write) for the bytestring integration. Therefore,
an end-user of bytestring will only see the Builder versions except
he's doing more low-level stuff to gain some extra performance.

 See 
 http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html
 for examples.

 Simon, is the reason for this duplication this comment on top of
 Blaze.ByteString.Builder.Word?

    Note that for serializing a three tuple (x,y,z) of bytes (or other
    word values) you should use the expression

        fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z

    instead of

        fromWord8 x `mappend` fromWord8 y `mappend` fromWord z

    The first expression will result in a single atomic write of three
    bytes, while the second expression will check for each byte, if
    there is free space left in the output buffer. Coalescing these
    checks can improve performance quite a bit, as long as you use it
    sensibly.

That's one of the reasons, but not the main one. The core reason is
that Write's provide
an interface between implementors of the low-level bit-twiddling
required to efficiently implement encodings like UTF-8 or Base16 and
the providers of efficient traversal functions through (streams of)
Haskell values. For simple traversals like

  fromWrite          :: Write

[Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-18 Thread Simon Meier
Hello Haskell-Cafe,

my main question is whether requiring FlexibleInstances is a problem
for code that aims to become part of the Haskell platform. The
following explanation gives the context for this question.

As some of you may know the blaze-builder library is now used in quite
a few places. That's nice, but it doesn't mean that blaze-builder is a
finished solution to the problem of providing an API for
high-performance buffered output (creation of chunked representations)
of sequences of bytes.

In fact, one of my current goals with this work is to polish it such
that it can be integrated into the 'bytestring' library. This has the
benefit that functions that create lazy bytestrings (e.g., pack, map,
unfoldr, filter) can be implemented such that they create well-sized
chunks even if the argument bytestring is hugely fragmented. Moreover,
this integration also establishes a single builder type as the output
representation. Therefore, other creators of bytestrings (e.g.,
'text', 'base16-bytestring', 'zlib') can provide results of type
Builder, which enables O(1) appends of their results and the
preservation of well-sizedness of the created chunks.

As part of this goal, I'm currently working on the first of the
following three points that are paramount to achieving great encoding
performance:

  1. Ensure that individual Haskell values are encoded with minimal overhead.
  2. Ensure that concatenation of sequences of bytes is efficient.
  3. Ensure that the average chunk size is large.

The core principle used to tackle (1) is avoiding intermediate data
structures.  The core abstraction used is the one of a Write (see [1]
for the corresponding library.)

  data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8))

A value `Write bound io :: Write a` denotes an encoding scheme for
values of type `a` that uses at most `bound` bytes space. Given a
values `x :: a` and a pointer `po` to the next free byte `io x po`
encodes `x` to memory starting from `po` and returns the pointer to
the next free byte after the encoding of `x`.

In most cases Writes are used as an abstract datatype. They serve as
an interface between implementors of the low-level bit-twiddling
required to efficiently implement encodings like UTF-8 or Base16 and
the providers of efficient traversal functions through streams of
Haskell values. Hence, typical users of Writes are functions like

  fromWrite  :: Write a - a - Builder
  fromWriteList  :: Write a - [a] - Builder
  fromWriteUnfoldr   :: Write b - (a - Maybe (b, a)) - a - Builder
  mapWriteByteString :: Write Word8 - S.ByteString - Builder

They consume the given datastructure efficiently and wrap the Writes
in the bounds checking code to detect when a buffer is full and
request a new one.

There are many providers of Writes. Each bounded-length-encoding of a
standard Haskell value is likely to have a corresponding Write. For
example, encoding an Int32 as a big-endian, little-endian, and
host-endian byte-sequence is currently achieved with the following
three functions.

  writeInt32BE :: Write Int32
  writeInt32LE :: Write Int32
  writeInt32HE :: Write Int32

I would like to avoid naming all these encodings individually.
Especially, as the situation becomes worse for more elaborate
encodings like hexadecimal encodings. There, we encounter encodings
like the utf8-encoding of the hexadecimal-encoding with lower-case
letters of an Int32.

  writeInt32HexLowerUtf8 :: Write Int32

I really don't like that. Therefore, I'm thinking about the following
solution based on type-classes. We introduce a single typeclass

  class Writable a where
  write :: Write a

and use a bunch of newtypes to denote our encodings.

  newtype Ascii7   a = Ascii7   { unAscii7   :: a }
  newtype Utf8 a = Utf8 { unUtf8 :: a }
  newtype HexUpper a = HexUpper { unHexUpper :: a }
  newtype HexLower a = HexLower { unHexLower :: a }
  ...

Assuming FlexibleInstnaces, we can write encodings like the above
hex-encoding as instances

  instance Write (Utf8 (HexLower Int32)) where
write = ...

This composes rather nicely and allows the implementations to exploit
special properties of the involved data. For example, if we also had a
HTML escaping marker

  newtype Html a = Html { unHtml :: a }

Then, the instance

  instance Write (Utf8 (HTML (HexLower Int32))) where
write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i))

exploits that no HTML escaping is required for a hex-number.  Assuming
FlexibleContexts, the user can also build abbreviations for builders
using fixed encodings.

  utf8 :: Writable (Utf8 a) = a - Builder
  utf8 = fromWrite write . Utf8

Note that, on the Builder level, a probably better way would be to
have an analogous 'ToBuilder' typeclass to abstract the various
encodings. Part of these instances then reuse the corresponding
instances from Writable.

I think this type-class based interface to select the correct
efficient implementation of an