[jira] [Updated] (HDDS-7297) Content sharing across objects.

Ritesh H Shukla (Jira) Fri, 07 Oct 2022 19:20:05 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ritesh H Shukla updated HDDS-7297:
----------------------------------
    Description: 
This request/suggestion was brought up by [~omalley] during 
[[https://www.apachecon.com/acna2022/]|Apache Con 2022].  [link 
title|http://example.com]

When mutating/creating a large table, there could be a huge performance boost 
achieved if applications can bring in data from either other existing objects 
or older versions of the same object. Thus, effectively the same copy of the 
data can be transparently addressed from multiple objects or when an object is 
updated.

This can take many forms from an implementation standpoint but we need to 
design the API surface for applications first.

To make progress, we need to do
 # Identify the API surface that needs to be exposed for applications such as 
iceberg or ORC writers to leverage this feature. Should be done via exposing 
underlying blocks or abstracting the blocks away and only addressing this as 
ranges in a file to be sourced from other files (and their corresponding 
ranges, similar to a scatter-gather list).
 ## Look into if this needs to be an extension of vectoredIO APIs.
 ##  Is there a need to expose the layout of sharable content  
 # Backend modeling of the API and how Ozone will make it work. This needs to 
be reasoned across EC and Replication.
 # How would this be made available as an extension to S3 APIs in addition to 
OFS.

The https://issues.apache.org/jira/browse/HDDS-7288 is a duplicate of this one. 
Filling this to capture the full context of the discussion. 

  was:
This request/suggestion was brought up by [~omalley] during [Apache Con 
2022|[https://www.apachecon.com/acna2022/]]. 

When mutating a large table, there would be a huge performance boost achieved 
if applications can address data from from either other objects stored 
previously or other versions of the same object. These objects could be older 
snapshots or other versions of the same object (maintained by iceberg or via 
snapshots or object versions in Ozone).

To make progress we need to do
 # Identify the API surface that needs to be exposed for applications such as 
iceberg or ocr writers to leverage this feature. Should be be done via exposing 
underlying blocks or abstracting the blocks away and only addressing this as 
ranges in a file to be sourced from other files (and their corresponding 
ranges, similar to a scatter gather list).
 ## Look into if this needs to be an extension of vectorIO APIs.
 ##  Is there a need to expose the layout of sharable content  
 # Backend modeling of the API and how Ozone will make it work. This needs to 
be reasoned across EC and Replication.
 # How would this be made available as an extension to S3 APIs in addition to 
OFS.

The https://issues.apache.org/jira/browse/HDDS-7288 is a duplicate of this one. 
Filling this to capture the full context of the discussion. 


> Content sharing across objects.
> -------------------------------
>
>                 Key: HDDS-7297
>                 URL: https://issues.apache.org/jira/browse/HDDS-7297
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Datanode, Ozone Filesystem, Ozone Manager
>            Reporter: Ritesh H Shukla
>            Assignee: Ritesh H Shukla
>            Priority: Major
>
> This request/suggestion was brought up by [~omalley] during 
> [[https://www.apachecon.com/acna2022/]|Apache Con 2022].  [link 
> title|http://example.com]
> When mutating/creating a large table, there could be a huge performance boost 
> achieved if applications can bring in data from either other existing objects 
> or older versions of the same object. Thus, effectively the same copy of the 
> data can be transparently addressed from multiple objects or when an object 
> is updated.
> This can take many forms from an implementation standpoint but we need to 
> design the API surface for applications first.
> To make progress, we need to do
>  # Identify the API surface that needs to be exposed for applications such as 
> iceberg or ORC writers to leverage this feature. Should be done via exposing 
> underlying blocks or abstracting the blocks away and only addressing this as 
> ranges in a file to be sourced from other files (and their corresponding 
> ranges, similar to a scatter-gather list).
>  ## Look into if this needs to be an extension of vectoredIO APIs.
>  ##  Is there a need to expose the layout of sharable content  
>  # Backend modeling of the API and how Ozone will make it work. This needs to 
> be reasoned across EC and Replication.
>  # How would this be made available as an extension to S3 APIs in addition to 
> OFS.
> The https://issues.apache.org/jira/browse/HDDS-7288 is a duplicate of this 
> one. Filling this to capture the full context of the discussion. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-7297) Content sharing across objects.

Reply via email to