[ 
https://issues.apache.org/jira/browse/HDDS-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625596#comment-17625596
 ] 

Stephen O'Donnell edited comment on HDDS-7350 at 10/28/22 9:46 PM:
-------------------------------------------------------------------

The design document attached is almost just an overview of the proposed 
feature. We need to think in some detail about some parts of this. For example:

With a compressed file - can we seek to an offset and start reading there?

Ozone currently writes data in "chunks" 4MB - do we open a new compression 
stream for each chunk? Or just one compression stream for the entire file?

Should the data in a chunk then be 4MB of compressed data, where it might much 
more than that of uncompressed data? Or do we keep the chunks at 4MB of 
uncompressed data, and then they are smaller when they are written to the 
datanode? That way, we know chunk 1 is from offset 0 -> 4MB, chunk 2 is 4MB -> 
8MB, etc.

Perhaps the chunk meta data could contain the uncompressed offsets in the file 
and the uncompressed size. That would allow for seeking to a chunk boundary and 
starting to read the new compression stream from there.

EC perhaps isn't too different. We would just EC encode the compressed chunks, 
although a variable chunksize might give EC problems. Whatever we do here, we 
would need to be sure EC can fit into the same framework, as users will surely 
want transparent compression on EC data too.

In EC, we implemented a kind of hierarchy to set the replication type of a key. 
There is a server default, bucket level setting and key level setting. That 
means if nothing is specified the server default is used. If there is a bucket 
setting key inherit it, but can override that if they like. Or if there is no 
bucket setting, the key level settings work. For consistency we should aim to 
do the same thing here.


was (Author: sodonnell):
The design document attached is almost just an overview of the proposed 
feature. We need to think in some detail about some parts of this. For example:

With a compressed file - can we seek to an offset and start reading there?

Ozone currently writes data in "chunks" 4MB - do we open a new compression 
stream for each chunk? Or just one compression stream for the entire file?

Should the data in a chunk then be 4MB of compressed data, where it might much 
more than that of compressed data? Or do we keep the chunks at 4MB of 
uncompressed data, and then they are smaller when they are written to the 
datanode? That way, we know chunk 1 is from offset 0 -> 4MB, chunk 2 is 4MB -> 
8MB, etc.

Perhaps the chunk meta data could contain the uncompressed offsets in the file 
and the uncompressed size. That would allow for seeking to a chunk boundary and 
starting to read the new compression stream from there.

EC perhaps isn't too different. We would just EC encode the compressed chunks, 
although a variable chunksize might give EC problems. Whatever we do here, we 
would need to be sure EC can fit into the same framework, as users will surely 
want transparent compression on EC data too.

In EC, we implemented a kind of hierarchy to set the replication type of a key. 
There is a server default, bucket level setting and key level setting. That 
means if nothing is specified the server default is used. If there is a bucket 
setting key inherit it, but can override that if they like. Or if there is no 
bucket setting, the key level settings work. For consistency we should aim to 
do the same thing here.

> Ozone Transparent Data Compression Support
> ------------------------------------------
>
>                 Key: HDDS-7350
>                 URL: https://issues.apache.org/jira/browse/HDDS-7350
>             Project: Apache Ozone
>          Issue Type: New Feature
>            Reporter: Kirill Sizov
>            Assignee: Kirill Sizov
>            Priority: Major
>         Attachments: compression_ozone - 2022.10.1.pdf, 
> compression_ozone-2022.10.2.pdf
>
>
> Currently Ozone stores uncompressed data, which in case of text or a similar 
> format may benefit from being compressed. This may save significant amount of 
> space and hence the money.
> See the attached document for the design.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to