Hi Paimon devs,

According to the previous discussion, I have written a proof of concept 
code[1]. The main functions are as follows:
1. Provide `data-file.external-path` to indicate the location of the newly 
written data. If this item is empty, the data is still written to the path 
specified by the warehouse as before.
2. Add the `dataRootLocation` attribute in DataFileMeta to indicate the 
location of the data. If `data-file.external-path` is not empty, this value is 
`data-file.external-path`, otherwise it is the warehouse path.
3. Provide TablePathProvider, which can build the storage path of the table 
according to `data-file.external-path` or warehouse path.
4. Provide HybridFileIO, which can create the corresponding FileIO according to 
the scheme.


The above is only part of the whole work. But from these codes, we can also see 
the complexity of this work and the changes in the code. Welcome to criticize 
and correct.


[1] https://github.com/apache/paimon/pull/4720/files


Best,
Houliang



---- Replied Message ----
| From | Houliang Qi<neuyi...@163.com> |
| Date | 12/13/2024 18:52 |
| To | dev@paimon.apache.org<dev@paimon.apache.org> |
| Subject | Re: [DISCUSS] Introduce Table Multi-Location Management |
Hi Jingsong,


Thank for your reply.

Initially, the design involved adding `multi.locations` and 
`default.write.location` as table properties. However, upon further 
consideration, it seems more efficient to move these filesystem options to the 
catalog itself and modify the catalog to support multiple warehouses. This way, 
the table properties would only need to include a `data-file.external-path`, If 
the attribute `data-file.external-path` is not empty, the default written data 
will be written to the storage specified by `data-file.external-path`.

Additionally, when migrating hot and cold data later, users could select the 
destination address for the migration based on the filesystem options provided 
in the catalog.

I will implement a POC code based on this new design and share it with the team 
for feedback.

Best,
Houliang



---- Replied Message ----
| From | Jingsong Li<jingsongl...@gmail.com> |
| Date | 12/13/2024 14:56 |
| To | <dev@paimon.apache.org> |
| Subject | Re: [DISCUSS] Introduce Table Multi-Location Management |
Hi Houliang,

Thanks for starting this discussion.

Maybe we can just introduce an option: `data-file.external-path`? I
don't the usage of multi.locations.

In DataFileMeta, yes, we need to add another field: external_path.

About FileIO, I think you can implement an own hybrid FileIO created
by catalog options.

I think the general idea is fine, but we may need a POC code to
observe its complexity.

Best,
Jingsong

On Wed, Dec 11, 2024 at 7:15 PM Houliang Qi <neuyi...@163.com> wrote:

Hi Paimon devs,


I’d like to initiate a discussion: Introduce Table Multi-Location 
Management[1], currently, the table's data can only be persisted in catalog's 
warehouse path, which can not be modified once it created, However, users may 
wish to store data from a table on different storage devices, or even store 
data from different partitions of a table on different storage devices based on 
their level of activity. So, the topic of this proposal is how to enable paimon 
to support multi-location management for a single table.


Any opinions are welcome, looking forward to your feedback, thanks.


[1] 
https://docs.google.com/document/d/1NhmOyxM16QmY_rVb3KJtCKRrU_nogIJv532U59qW7EI/edit?tab=t.0#heading=h.xlrl29nlxwpo



Reply via email to