1. Decompression:
- Try https://github.com/klauspost/pgzip - it's a drop-in replacement for 
compress/gzip and the author claims that is has twice the decompression 
speed due to better buffering (lower IO wait)
- Gzip decompression is single threaded - to use all cores decode multiple 
files at the same time
- Your storage (EBS g2) has max throughput of 160MB/s per volume 
(see https://aws.amazon.com/ebs/details/ ) - assuming you get max 
throughput (which is not guaranteed) just reading 5GB and writing 20GB will 
take almost 3 minutes. When downloading you only have to write 5GB, which 
is why it's faster. To get better speeds use a ram drive (hey, you have 
122GB of RAM), use st1 EBS or multiple EBS volumes in RAID0

2. Upload
- If you're using AWS SDK with chunked uploads then stop - it will fist 
load the entire file into memory (so read + allocate...), then calculates 
hash of the whole file (and not a "cheap one") and then a signs each part 
of the file, eventually sending it over https (so +encryption...)
- https://docs.aws.amazon.com/sdk-for-go/api/service/s3/s3manager/ with 
multi-part uploads. Experiment with part size to find the sweet spot (most 
likely somewhere between the minimum 5MB and 64MB), use higher concurrency 
value - default is 5 and that often is 10-50 times too low, YMWV


On Friday, February 10, 2017 at 7:58:50 PM UTC-5, mukund....@gmail.com 
wrote:
>
> Hello,
>
> I have written a GO program which downloads a 5G compressed CSV from 
> Amazon S3, decompresses it and uploads the decompressed CSV (20G) to Amazon 
> S3.
>
> Amazon S3 provides a default concurrent uploader/downloader and I am using 
> a multithreaded approach to download files in parallel, decompress and 
> upload. The program seems to work fine, however I believe the program could 
> be optimized further. And not all the cores are used though I have 
> parallelized for the no. of CPUs available . The CPU usage is only around 
> 30-40% . I see a IO wait around 30/40% percent. 
>
> The download happens faster, The decompression takes 5-6 minutes and the 
> upload happens in parallel but takes almost an hour to upload  a set of 8 
> files. 
>
> For decompression, I use 
> reader, err := gzip.NewReader(gzipfile)
> writer, err := os.Create(outputFile)
> err = io.Copy(writer, reader)
>
> I use a 16CPU, 122 GB RAM, 500 GB SSD instance
>
> Are there any other methodologies where I can optimize compresssion part 
> and upload part
>
> I am pretty new to Golang.  Any guidance is very much appreciated.
>
> Regards
> Mukund
>
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to