[jira] [Updated] (HDDS-7297) Content sharing across objects.

Ritesh H Shukla (Jira) Fri, 07 Oct 2022 19:22:06 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-7297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ritesh H Shukla updated HDDS-7297:
----------------------------------
    Description: 
This request/suggestion was brought up by [~omalley] during 
[[https://www.apachecon.com/acna2022/]|Apache Con 2022].  [link 
title|http://example.com/]

When mutating/creating a large table, there could be a huge performance boost 
achieved if applications can bring in data from either other existing objects 
or older versions of the same object. Thus, effectively the same copy of the 
data can be transparently addressed from multiple objects or when an object is 
updated.

This capability can take many forms from an implementation standpoint, but we 
must design the API surface for applications first.

To make progress, we need to do
 # Identify the API surface that needs to be exposed for applications such as 
iceberg or ORC writers to leverage this feature. Should be done via exposing 
underlying blocks or abstracting the blocks away and only addressing this as 
ranges in a file to be sourced from other files (and their corresponding 
ranges, similar to a scatter-gather list).
 ## Should this be an extension of vectorO APIs?
 ##  Is there a need to expose the layout of sharable content  
 # Backend modeling of the API and how Ozone will make it work. This needs to 
be reasoned across EC and Replication.
 # How would this be made available as an extension to S3 APIs in addition to 
OFS.

The https://issues.apache.org/jira/browse/HDDS-7288 is a duplicate of this one. 
Filling this in to capture the full context of the discussion. 

  was:
This request/suggestion was brought up by [~omalley] during 
[[https://www.apachecon.com/acna2022/]|Apache Con 2022].  [link 
title|http://example.com/]

When mutating/creating a large table, there could be a huge performance boost 
achieved if applications can bring in data from either other existing objects 
or older versions of the same object. Thus, effectively the same copy of the 
data can be transparently addressed from multiple objects or when an object is 
updated.

This capability can take many forms from an implementation standpoint, but we 
need to design the API surface for applications first.

To make progress, we need to do
 # Identify the API surface that needs to be exposed for applications such as 
iceberg or ORC writers to leverage this feature. Should be done via exposing 
underlying blocks or abstracting the blocks away and only addressing this as 
ranges in a file to be sourced from other files (and their corresponding 
ranges, similar to a scatter-gather list).
 ## Look into if this needs to be an extension of vectoredIO APIs.
 ##  Is there a need to expose the layout of sharable content  
 # Backend modeling of the API and how Ozone will make it work. This needs to 
be reasoned across EC and Replication.
 # How would this be made available as an extension to S3 APIs in addition to 
OFS.

The https://issues.apache.org/jira/browse/HDDS-7288 is a duplicate of this one. 
Filling this to capture the full context of the discussion. 


> Content sharing across objects.
> -------------------------------
>
>                 Key: HDDS-7297
>                 URL: https://issues.apache.org/jira/browse/HDDS-7297
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Datanode, Ozone Filesystem, Ozone Manager
>            Reporter: Ritesh H Shukla
>            Assignee: Ritesh H Shukla
>            Priority: Major
>
> This request/suggestion was brought up by [~omalley] during 
> [[https://www.apachecon.com/acna2022/]|Apache Con 2022].  [link 
> title|http://example.com/]
> When mutating/creating a large table, there could be a huge performance boost 
> achieved if applications can bring in data from either other existing objects 
> or older versions of the same object. Thus, effectively the same copy of the 
> data can be transparently addressed from multiple objects or when an object 
> is updated.
> This capability can take many forms from an implementation standpoint, but we 
> must design the API surface for applications first.
> To make progress, we need to do
>  # Identify the API surface that needs to be exposed for applications such as 
> iceberg or ORC writers to leverage this feature. Should be done via exposing 
> underlying blocks or abstracting the blocks away and only addressing this as 
> ranges in a file to be sourced from other files (and their corresponding 
> ranges, similar to a scatter-gather list).
>  ## Should this be an extension of vectorO APIs?
>  ##  Is there a need to expose the layout of sharable content  
>  # Backend modeling of the API and how Ozone will make it work. This needs to 
> be reasoned across EC and Replication.
>  # How would this be made available as an extension to S3 APIs in addition to 
> OFS.
> The https://issues.apache.org/jira/browse/HDDS-7288 is a duplicate of this 
> one. Filling this in to capture the full context of the discussion. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-7297) Content sharing across objects.

Reply via email to