curioustien commented on PR #44043:
URL: https://github.com/apache/arrow/pull/44043#issuecomment-2586007265

   > Perhaps a quick check that other Parquet readers only use the row group 
ordinal for encrypted files?
   
   @mapleFU @pitrou I'm trying to pick up this task to ramp up on the parquet 
codebase. Upon checking the parquet java implementation of this `ordinal` field 
in `RowGroup`, I think there is a small bug/discrepancy in the current cpp 
implementation vs the java implementation.
   
   For the java implementation, there is [this check whether it's an encrypted 
file or 
not](https://github.com/apache/parquet-java/blob/apache-parquet-1.15.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L606-L653)
 by checking the `fileEncryptor` pointer. If it's not an encrypted file, the 
`rowGroupOrdinal` value is `-1` which I believe it means that Thrift won't set 
that field. Otherwise, we use the `rowGroupOrdinal` value which is [incremented 
every time we flush the row 
group](https://github.com/apache/parquet-java/blob/apache-parquet-1.15.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L207).
 Though, there is a small bug here in the java implementation that we use `int` 
instead of `short` for `int16 ordinal` variable. Plus, there are no checks here 
on integer overflow. In practice, people probably don't hit this limit on the 
number of row groups anyways, so this bug hasn't showed up yet.
   
   For the cpp implementation, we don't really have a check for the encrypted 
file like the java implementation when we write row group ordinal (not that 
I've found, or maybe I'm missing something), so we always write this ordinal 
value even for unencrypted files. In my opinion, we should change the logic 
here to behave like the java implementation with an additional check on integer 
overflow.
   
   What do you both think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to