Sorry, I wasn't clear. Just meant something simpler. Compress the matrix to copy it to the GPU for faster transfers (and uncompress it appropriately on the GPU).
Barry > On Jul 11, 2019, at 10:49 AM, Jed Brown <[email protected]> wrote: > > I don't know anything about zstd (or competitive compression) for GPU, > but doubt it works at the desired granularity. I think SpMV on late-gen > CPUs can be accelerated by zstd column index compression, especially for > semi-structured problems, but likely also for unstructured problems > numbered by breadth-first search or similar. But we'd need to demo that > use specifically. > > "Smith, Barry F." <[email protected]> writes: > >> CPU to GPU? Especially matrices? >> >>> On Jul 11, 2019, at 9:05 AM, Jed Brown via petsc-dev >>> <[email protected]> wrote: >>> >>> Zstd is a remarkably good compressor. I've experimented with it for >>> compressing column indices for sparse matrices on structured grids and >>> (after a simple transform: subtracting the row number) gotten >>> decompression speed in the neighborhood of 10 GB/s (i.e., faster per >>> core than DRAM). I've been meaning to follow up. The transformation >>> described below (splitting the bytes) is yielding decompression speed >>> around 1GB/s (in this link below), which isn't competitive for things >>> like MatMult, but could be useful for things like trajectory >>> checkpointing. >>> >>> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view >>> >>> >>> From: "Radev, Martin" <[email protected]> >>> Subject: Re: Adding a new encoding for FP data >>> Date: July 11, 2019 at 4:55:03 AM CDT >>> To: "[email protected]" <[email protected]> >>> Cc: "Raoofy, Amir" <[email protected]>, "Karlstetter, Roman" >>> <[email protected]> >>> Reply-To: <[email protected]> >>> >>> >>> Hello Liya Fan, >>> >>> >>> this explains the technique but for a more complex case: >>> >>> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/ >>> >>> For FP data, the approach which seemed to be the best is the following. >>> >>> Say we have a buffer of two 32-bit floating point values: >>> >>> buf = [af, bf] >>> >>> We interpret each FP value as a 32-bit uint and look at each individual >>> byte. We have 8 bytes in total for this small input. >>> >>> buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3] >>> >>> Then we apply stream splitting and the new buffer becomes: >>> >>> newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3] >>> >>> We compress newbuf. >>> >>> Due to similarities the sign bits, mantissa bits and MSB exponent bits, we >>> might have a lot more repetitions in data. For scientific data, the 2nd and >>> 3rd byte for 32-bit data is probably largely noise. Thus in the original >>> representation we would always have a few bytes of data which could appear >>> somewhere else in the buffer and then a couple bytes of possible noise. In >>> the new representation we have a long stream of data which could compress >>> well and then a sequence of noise towards the end. >>> >>> This transformation improved compression ratio as can be seen in the report. >>> >>> It also improved speed for ZSTD. This could be because ZSTD makes a >>> decision of how to compress the data - RLE, new huffman tree, huffman tree >>> of the previous frame, raw representation. Each can potentially achieve a >>> different compression ratio and compression/decompression speed. It turned >>> out that when the transformation is applied, zstd would attempt to compress >>> fewer frames and copy the other. This could lead to less attempts to build >>> a huffman tree. It's hard to pin-point the exact reason. >>> >>> I did not try other lossless text compressors but I expect similar results. >>> >>> For code, I can polish my patches, create a Jira task and submit the >>> patches for review. >>> >>> >>> Regards, >>> >>> Martin >>> >>> >>> ________________________________ >>> From: Fan Liya <[email protected]> >>> Sent: Thursday, July 11, 2019 11:32:53 AM >>> To: [email protected] >>> Cc: Raoofy, Amir; Karlstetter, Roman >>> Subject: Re: Adding a new encoding for FP data >>> >>> Hi Radev, >>> >>> Thanks for the information. It seems interesting. >>> IMO, Arrow has much to do for data compression. However, it seems there are >>> some differences for memory data compression and external storage data >>> compression. >>> >>> Could you please provide some reference for stream splitting? >>> >>> Best, >>> Liya Fan >>> >>> On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin <[email protected]> wrote: >>> >>>> Hello people, >>>> >>>> >>>> there has been discussion in the Apache Parquet mailing list on adding a >>>> new encoder for FP data. >>>> The reason for this is that the supported compressors by Apache Parquet >>>> (zstd, gzip, etc) do not compress well raw FP data. >>>> >>>> >>>> In my investigation it turns out that a very simple simple technique, >>>> named stream splitting, can improve the compression ratio and even speed >>>> for some of the compressors. >>>> >>>> You can read about the results here: >>>> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view >>>> >>>> >>>> I went through the developer guide for Apache Arrow and wrote a patch to >>>> add the new encoding and test coverage for it. >>>> >>>> I will polish my patch and work in parallel to extend the Apache Parquet >>>> format for the new encoding. >>>> >>>> >>>> If you have any concerns, please let me know. >>>> >>>> >>>> Regards, >>>> >>>> Martin >>>> >>>> >>> >>>
