[GitHub] [iceberg] zhangjun0x01 commented on issue #1666: the fileSizeInBytes of orc and parquet are inconsistent

GitBox Wed, 28 Oct 2020 18:28:49 -0700


zhangjun0x01 commented on issue #1666:
URL: https://github.com/apache/iceberg/issues/1666#issuecomment-718301579



   hi,@rdblue,@shardulm94:
   I read the source code. I found that when constructing the DataFile in the 
BaseTaskWriter.RollingFileWriter#closeCurrent method, we get the 
fileSizeInBytes by the length() method of the currentAppender, and the 
OrcFileAppender uses the getRawDataSize() method of the ORC Writer to get the 
length. I read the comments of this method. It use the deserialized data size.
   
   ```
     /**
      * Return the deserialized data size. Raw data size will be compute when
      * writing the file footer. Hence raw data size value will be available 
only
      * after closing the writer.
      *
      * @return raw data size
      */
     long getRawDataSize();
   ```
   
   Parquet get length by position. I don't know which is correct of orc and 
parquet, but I think the length obtained in parquet format meets my 
expectations. Because when I query the hdfs file by the fsck command of hdfs, I 
found that it split the block according to the file size, not the deserialized 
data size.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] zhangjun0x01 commented on issue #1666: the fileSizeInBytes of orc and parquet are inconsistent

Reply via email to