Thanks for the contribution. This sounds very interesting. Few comments.

* | fileio.MatchFiles('hdfs://path/to/*.zip') | fileio.ExtractMatches() |
fileio.MatchAll()

We usually either do 'fileio.MatchFiles('hdfs://path/to/*.zip')' or
'fileio.MatchAll()'. Former to read a specific glob and latter to read a
PCollection of glob. We also have support for reading compressed files. We
should add to that API instead of using both.

* ArchiveSystem with list() and extract().

Is this something we can add to the existing FileSystems abstraction
instead of introducing a new abstraction ?

*
fileio.CompressMatches
fileio.WriteToArchive

Is this scalable for a distributed system ? Usually we write a file per
bundle.

I suggest writing a doc with some background research related to how other
data processing systems achieve this functionality so that we can try to
determine if the functionality can be added to the existing API somehow.

Thanks,
Cham






On Wed, May 27, 2020 at 9:10 AM Ashwin Ramaswami <[email protected]>
wrote:

> I have a requirement where I need to read from / write to archive files
> (such as .tar, .zip). Essentially, I'd like to treat the entire .zip file I
> read from as a filesystem, so that I can only get the files I need that are
> within the archive. This is useful, because some archive formats such as
> .zip allow random access (so one does not need to read the entire zip file
> in order to just read a single file from it).
>
> I've made an issue outlining how this might be designed -- would
> appreciate any feedback or thoughts about how this might work!
> https://issues.apache.org/jira/browse/BEAM-10111
>

Reply via email to