emkornfield commented on issue #39581:
URL: https://github.com/apache/arrow/issues/39581#issuecomment-1890860392

   > def/rep level encoding is part of parquet standard, which is not a part of 
arrow 🤔 Maybe we can trying to making it faster...
   
   Agree, while its not explicitly stated, even [though the thrift definition 
allows for specifying 
encoding](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L585)
 I don't think plain encoding was ever intended here (would have to take a 
closer look at parquet-mr).  And since there isn't a native primitive type for 
[int16](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32)
 we'd have to use int32 as the primitive type, which I guess would compress 
well but still require a cast.
   
   > Probably bit pack can use 
https://github.com/powturbo/TurboPFor-Integer-Compression
   And then write levels with bit pack
   
   I think [original code for bin 
packing](https://github.com/apache/arrow/blob/e6323646558ee01234ce58af273c5a834745f298/cpp/src/arrow/util/bpacking_default.h#L20)
  probably came from that source.
   
   > Also, after re-checking the RleDecoder, I think unpack16 might improve the 
performance but it's not the bottleneck in your case, here maybe the 
GetBatch<int16_t> loop be slow...
   
   I think @mapleFU suggestion of having a specialized unpack16 makes sense, I 
think at the very least this avoids an extra round trip from to/from the cache 
if we can write directly to the output value instead of having a temporary 
value holder?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to