Compression in ORC not only crosses rows, but across the row groups (every 10k rows) that are the index points. Look at the ORC specification ( https://orc.apache.org/specification/ORCv1/) on Compression. Compression does not cross stripe boundaries, because that would violate the constraint that you can read each stripe independently. The expected case is to stream through all of the rows in a stripe, so it is optimized for improving compression.
Note that the constraints also don't run in the other direction. A single value may cross several compression chunks. All of the kinds of streams use the generic (zlib, zstd, snappy, etc.) the same way. The generic compression is the last stage of the process. You should probably look at how seek is done using the indexes. To seek to the start of a row group, we keep a list of integers for each stream. For a compressed integer stream, the index will have three values: - <compressed byte offset from start of stream of a compression chunk> - <uncompressed byte offset from compression chunk of rle block> - <rle offset with rle block> So to jump to row 10000, you'd use the first number to find the number of compressed bytes to jump over and you'd decompress starting from there. >From the decompressed bytes, you'd skip over the second value number of bytes and start the rle decompression. Now you'd use the third number to skip over that many values from the rle. .. Owen On Tue, Jul 26, 2022 at 6:10 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > In variable-length types like strings, there is the stream kind "Data" > containing the concatenated values. When decoding to e.g. a vector of > strings, is there any constraint over whether compression breaks the values > boundaries? > > I.e. say we have a string column with 2 rows, each with 100Mb each, [r1,r2] > (which are concatenated in "Data"). Can we end up with a compression where > r2 is split in between two compressions? > > Is this also valid for the stream kind "Length"? > > More broadly, the question is whether, when deserializing we need to > "concatenate" bytes from parts of the compressed items or whether we can > assume that compression respects row boundaries. > > Best, > Jorge >