Based on your use case, I don't think file join will be a suitable solution. 
There is a limit on the number of files that can be joined (about 2000) and 
this would make for an unusual file format (something like a tar file, but 
would need special tools to access). It would also be very Lustre-specific. 

Instead, my recommendation would be to use an ext4 filesystem image to hold the 
many small files (during create, if from a single client, or aggregated after 
they are created).  Later, this filesystem image could be mounted read-only on 
multiple clients for access. Also, the whole image file can be archived to tape 
efficiently (taking all small files with it, instead of keeping a stub in 
Lustre for each file).

The use of loopback mounting image files from Lustre already works today, but 
needs userspace help to create and mount/unmount them. There was some proposal 
"Client Container Image (CCI)" on how this could be integrated directly into 
Lustre.  Please see my LUG presentation for details (maybe 2019 or so?)

Cheers, Andreas

> On Mar 30, 2023, at 00:47, Sven Willner <[email protected]> wrote:
> 
> Dear Patrick and Anders,
> 
> Thank you very much for your quick and comprehensive replies.
> 
> My motivation behind this issues is the following:
> At my institute (research around a large earth system/climate model) we are 
> evaluating using zarr (https://zarr.readthedocs.io) for outputing large 
> multi-dimensional arrays. This currently results in a huge number of small 
> files as the responsibility of parallel writing is fully shifted to the file 
> system. However, after closing the respective datasets we could merge those 
> files again to reduce the metadata burden onto the file system and for easier 
> archival if needed at a later point. Ideally without copying the large amount 
> of data again. For read access I would simply create an appropriate 
> index/lookup table for the resulting large file - hence holes/gaps in the 
> file are not a problem as such.
> 
> As Patrick writes
>> Layout: 1 1 1 1 1 1 1 ... 20 MiB 2 2 2 2 2 2 .... 35 MiB
>> 
>> With data from 0-10 MiB and 20 - 30 MiB.
> that would be the resulting layout (I guess, minimizing holes could be 
> achieved by appropriate striping of the original files and/or a layout 
> adjustment during the merge, if possible).
> 
>> My expectation is that "join" of two files would be handled at the file EOF 
>> and *not* at the layout boundary.  Based on the original description from 
>> Sven, I'd think that small gaps in the file (e.g. 4KB for page alignment, 
>> 64KB for minimum layout alignment, or 1MB for stripe alignment) would be OK, 
>> but tens or hundreds of MB holes would be inefficient for processing.
> (Andreas)
> 
> Apart from archival, the resulting file would only be accessed locally in the 
> boundaries of the orginial smaller files, so I would expect the performance 
> costs of the gaps to be not that critical.
> 
>> while I think it is possible to implement this in Lustre, I'd have to ask 
>> what requirements are driving your request?  Is this just something you want 
>> to test, or is there some real-world usage demand for this (e.g. specific 
>> application workload, usage in some popular library, etc)?
> (Andreas)
> 
> At this stage I am just looking into possibilites to handle this situation - 
> I am neither an expert in zarr nor in Lustre.
> 
> If such a merge on the file system level turns out to be route worth taking, 
> I would be happy to work on an implementation. However, yes, I would need 
> some guidance there. Also, at this point I cannot estimate the amount of work 
> needed even to test this approach.
> 
> Would the necessary layout manipulation be possible in userspace? (I will 
> have a look into the implementations of `lfs migrate` and `lfs mirror 
> extend`).
> 
> Thanks a lot!
> Best,
> Sven
> 
> On Wed, Mar 29, 2023 at 07:41:56PM +0000, Andreas Dilger wrote:
> [-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 8.2K --]
>> Patrick,
>> once upon a time there was "file join" functionality in Lustre that was 
>> ancient and complex, and was finally removed in 2009.  There are still a few 
>> remnants of this like "MDS_OPEN_JOIN_FILE" and "LOV_MAGIC_JOIN_V1" defined, 
>> but unused.   That functionality long predated composite file layouts (PFL, 
>> FLR), and used an external llog file *per file* to declare a series of other 
>> files that described the layout.  It was extremely fragile and complex and 
>> thankfully never got into widespread usage.
>> 
>> I think with the advent of composite file layout that it should be 
>> _possible_ to implement this kind of functionality purely with layout 
>> changes, similar to "lfs migrate" doing layout swap, or "lfs mirror extend" 
>> merging the layout of a victim file into another file to create a mirror.
>> 
>> My expectation is that "join" of two files would be handled at the file EOF 
>> and *not* at the layout boundary.  Based on the original description from 
>> Sven, I'd think that small gaps in the file (e.g. 4KB for page alignment, 
>> 64KB for minimum layout alignment, or 1MB for stripe alignment) would be OK, 
>> but tens or hundreds of MB holes would be inefficient for processing.
>> 
>> My guess, based on similar requests I've seen previously, and Sven's email 
>> address, is that this relates to merging video streams from different files 
>> into a single file?
>> 
>> Sven,
>> while I think it is possible to implement this in Lustre, I'd have to ask 
>> what requirements are driving your request?  Is this just something you want 
>> to test, or is there some real-world usage demand for this (e.g. specific 
>> application workload, usage in some popular library, etc)?
>> 
>> It seems possible to do this with layout manipulation similar to "lfs mirror 
>> extend -f" (i.e. a kind of "super file append" mechanism) but would be 
>> similarly destructive to the "victim" files appended to the original one, 
>> and would definitely not be something that could be done while the 
>> "original" file was actively in use.  Essentially, instead of "lfs mirror 
>> extend" just appending the victim layout to the existing file, it would need 
>> to also modify the original layout to truncate the layout at EOF, then 
>> offset the extent ranges in the victim layout by the current file size 
>> (rounded up to at least 64KB multiples, but preferably 1MB multiples to 
>> maintain RAID alignment).
>> 
>> Is this something that you would be willing to work on with guidance for the 
>> implementation details, or a feature request that you hope someone else will 
>> implement?
>> 
>> Cheers, Andreas
>> 
>> On Mar 29, 2023, at 07:41, Patrick Farrell via lustre-discuss 
>> <[email protected]<[1]>> wrote:
>> 
>> Sven,
>> 
>> The "combining layouts without any data movement" part isn't currently 
>> possible.  It's probably possible in theory, but it's never been 
>> implemented.  (I'm curious what your use case is?)
>> 
>> Even allowing for data movement, there's no tool to do this for you.  
>> Depending what you mean by combining, it's possible to do this with Linux 
>> tools (see the end of my note), but you're going to have data copying.
>> 
>> It's a bit of an odd requirement, with some inherent questions - For 
>> example, file layouts generally go to infinity, because if they don't, you 
>> will get IO errors when you 'run off the end', ie, go past the defined 
>> layout, so the last component is usually defined to go to infinity.
>> 
>> That poses obvious questions when combining files.
>> 
>> If you're looking to combine files with layouts that do not go to infinity, 
>> then it's at least straightforward to see how you'd concatenate them.  But 
>> presumably the data in each file doesn't go to the very end of the layout?  
>> So do you want the empty parts of the layout included?
>> 
>> Say file 1 is 10 MiB in size but the layout goes to 20 MiB (again, layouts 
>> normally should go to infinity) and file 2 is also 10 MiB in size but the 
>> layout goes to, say, 15 MiB.  Should the result look like this?
>> 
>> Layout: 1 1 1 1 1 1 1 ... 20 MiB 2 2 2 2 2 2 .... 35 MiB
>> 
>> With data from 0-10 MiB and 20 - 30 MiB.
>> 
>> That's something you'd have to write a tool for, so it could write the data 
>> at your specified offset for putting in the second file (and third, etc...). 
>>  You could also do something like:
>> 
>> lfs setstripe [your layout] combined file; cat file 1 > combined file; 
>> truncate [combined file] 20 MiB (the end of the file 1 layout); cat file 2 > 
>> combined_file", etc.
>> 
>> So, you definitely can't avoid data copying here.  But that's how you could 
>> do it with simple Linux tools (which you could probably have drawn up 
>> yourself :)).
>> 
>> -Patrick
>> 
>> ________________________________
>> From: lustre-discuss <[email protected]<[2]>> on 
>> behalf of Sven Willner <[email protected]<[3]>>
>> Sent: Wednesday, March 29, 2023 7:58 AM
>> To: [email protected]<[1]> 
>> <[email protected]<[1]>>
>> Subject: [lustre-discuss] Joining files
>> 
>> [You don't often get email from [email protected]<[3]>. Learn why 
>> this is important at [4] ]
>> 
>> Dear all,
>> 
>> I am looking for a way to join/merge/concatenate several files into one, 
>> whose layout is just the concatenation of the layouts of the respective 
>> files - ideally without any copying/moving on the data side (even if this 
>> would result in "holes" in the joined file).
>> 
>> I would very much appreciate any hints to tools or ideas of how to achieve 
>> such a join. As I understand that has been a `join` command for `lfs`, which 
>> is now deprecated (however, I am not sure if a use case like mine has been 
>> its purpose or why it has been deprecated).
>> 
>> Thanks a lot!
>> Best regards,
>> Sven
>> 
>> --
>> Dr. Sven Willner
>> Scientific Computing Lab (SCLab)
>> Max Planck Institute for Meteorology
>> Bundesstraße 53, D-20146 Hamburg, Germany
>> _______________________________________________
>> lustre-discuss mailing list
>> [email protected]<[1]>
>> [5]
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Whamcloud
>> 
>> 
>> 
>> ----------------------
>> 
>> Links:
>> 
>> [1] mailto:[email protected]
>> [2] mailto:[email protected]
>> [3] mailto:[email protected]
>> [4] https://aka.ms/LearnAboutSenderIdentification
>> [5] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> [-- Alternative Type #1: text/html; charset=utf-8, Encoding: base64, Size: 
> 35K --]
> 
> -- 
> Dr. Sven Willner
> Scientific Computing Lab (SCLab)
> Max Planck Institute for Meteorology
> Bundesstraße 53, D-20146 Hamburg, Germany
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
  • [... Sven Willner
    • ... Patrick Farrell via lustre-discuss
      • ... Andreas Dilger via lustre-discuss
        • ... Sven Willner
          • ... Andreas Dilger via lustre-discuss
    • ... Patrick Farrell via lustre-discuss

Reply via email to