I remember I encountered an OOM when using spark broadcast when AE is enabled. which seems like the same issue here. The AE takes the compressed data into consideration but decompress the data when broadcasting and thus throws OOM.
On Fri, Aug 2, 2019 at 11:55 PM Wes McKinney <[email protected]> wrote: > > I've been working (with Hatem Helal's assistance!) the last few months > to put the pieces in place to enable reading BYTE_ARRAY columns in > Parquet files directly to Arrow DictionaryArray. As context, it's not > uncommon for a Parquet file to occupy ~100x less (even greater > compression factor) space on disk than fully-decoded in memory when > there are a lot of common strings. Users get frustrated sometimes when > they read a "small" Parquet file and have memory use problems. > > I made a benchmark to exhibit an example "worst case scenario" > > https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7 > > In this example, we have a table with a single column containing 10 > million values drawn from a dictionary of 1000 values that's about 50 > kilobytes in size. Written to Parquet, the file a little over 1 > megabyte due to Parquet's layers of compression. But read naively to > Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54 > bytes per value). With the new decoding machinery, we can skip the > dense decoding of the binary data and append the Parquet file's > internal dictionary indices directly into an arrow::DictionaryBuilder, > yielding a DictionaryArray at the end. The end result uses less than > 10% as much memory (about 40MB compared with 500MB) and is almost 20x > faster to decode. > > The PR making this available finally in Python is here: > https://github.com/apache/arrow/pull/4999 > > Complex, multi-layered projects like this can be a little bit > inscrutable when discussed strictly at a code/technical level, but I > hope this helps show that employing dictionary encoding can have a lot > of user impact both in memory use and performance. > > - Wes -- Thanks & Best Regards
