emkornfield commented on issue #39581: URL: https://github.com/apache/arrow/issues/39581#issuecomment-1890860392
> def/rep level encoding is part of parquet standard, which is not a part of arrow 🤔 Maybe we can trying to making it faster... Agree, while its not explicitly stated, even [though the thrift definition allows for specifying encoding](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L585) I don't think plain encoding was ever intended here (would have to take a closer look at parquet-mr). And since there isn't a native primitive type for [int16](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32) we'd have to use int32 as the primitive type, which I guess would compress well but still require a cast. > Probably bit pack can use https://github.com/powturbo/TurboPFor-Integer-Compression And then write levels with bit pack I think [original code for bin packing](https://github.com/apache/arrow/blob/e6323646558ee01234ce58af273c5a834745f298/cpp/src/arrow/util/bpacking_default.h#L20) probably came from that source. > Also, after re-checking the RleDecoder, I think unpack16 might improve the performance but it's not the bottleneck in your case, here maybe the GetBatch<int16_t> loop be slow... I think @mapleFU suggestion of having a specialized unpack16 makes sense, I think at the very least this avoids an extra round trip from to/from the cache if we can write directly to the output value instead of having a temporary value holder? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
