Compression in ORC not only crosses rows, but across the row groups (every
10k rows) that are the index points. Look at the ORC specification (
https://orc.apache.org/specification/ORCv1/) on Compression. Compression
does not cross stripe boundaries, because that would violate the constraint
that you can read each stripe independently. The expected case is to stream
through all of the rows in a stripe, so it is optimized for improving
compression.

Note that the constraints also don't run in the other direction. A single
value may cross several compression chunks.

All of the kinds of streams use the generic (zlib, zstd, snappy, etc.) the
same way. The generic compression is the last stage of the process.

You should probably look at how seek is done using the indexes. To seek to
the start of a row group, we keep a list of integers for each stream. For a
compressed integer stream, the index will have three values:

   - <compressed byte offset from start of stream of a compression chunk>
   - <uncompressed byte offset from compression chunk of rle block>
   - <rle offset with rle block>

So to jump to row 10000, you'd use the first number to find the number of
compressed bytes to jump over and you'd decompress starting from there.
>From the decompressed bytes, you'd skip over the second value number of
bytes and start the rle decompression. Now you'd use the third number to
skip over that many values from the rle.

.. Owen




On Tue, Jul 26, 2022 at 6:10 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> In variable-length types like strings, there is the stream kind "Data"
> containing the concatenated values. When decoding to e.g. a vector of
> strings, is there any constraint over whether compression breaks the values
> boundaries?
>
> I.e. say we have a string column with 2 rows, each with 100Mb each, [r1,r2]
> (which are concatenated in "Data"). Can we end up with a compression where
> r2 is split in between two compressions?
>
> Is this also valid for the stream kind "Length"?
>
> More broadly, the question is whether, when deserializing we need to
> "concatenate" bytes from parts of the compressed items or whether we can
> assume that compression respects row boundaries.
>
> Best,
> Jorge
>

Reply via email to