On Thu, May 28, 2020 at 9:34 AM Chamikara Jayalath <[email protected]>
wrote:
> Thanks for the contribution. This sounds very interesting. Few comments.
>
> * | fileio.MatchFiles('hdfs://path/to/*.zip') | fileio.ExtractMatches() |
> fileio.MatchAll()
>
> We usually either do 'fileio.MatchFiles('hdfs://path/to/*.zip')' or
> 'fileio.MatchAll()'. Former to read a specific glob and latter to read a
> PCollection of glob. We also have support for reading compressed files. We
> should add to that API instead of using both.
>
> * ArchiveSystem with list() and extract().
>
> Is this something we can add to the existing FileSystems abstraction
> instead of introducing a new abstraction ?
>
+1
In particular, something like
zip://hdfs://path/to/zip:glob/within/zip/*.txt could be a new zipfile
filesystem that can support parallel reads and delegate to any other
filesystem. One could then write
p | fileio.MatchFiles('hdfs://path/to/*.zip') # produces a PCollection
of zip file paths
| fileio.ExtractMatches() # produces a PCollection of zip file
entries, using a zipfile filesystem
| fileio.ReadMatches() # actually reads the files. One could to a text
read, or whatever, here as well.
| ...
Note that tar files do not support random access (or even listing without
reading the entire contents), so are poorly suited for this.
> *
> fileio.CompressMatches
> fileio.WriteToArchive
>
> Is this scalable for a distributed system ? Usually we write a file per
> bundle.
>
> I suggest writing a doc with some background research related to how other
> data processing systems achieve this functionality so that we can try to
> determine if the functionality can be added to the existing API somehow.
>
Yeah, zip files are not writable in parallel. One /could/ do the
compression in parallel, and then have a final "writer" that just does
concat (with the appropriate headers) to the final zipfile(s).
On Wed, May 27, 2020 at 9:10 AM Ashwin Ramaswami <[email protected]>
> wrote:
>
>> I have a requirement where I need to read from / write to archive files
>> (such as .tar, .zip). Essentially, I'd like to treat the entire .zip file I
>> read from as a filesystem, so that I can only get the files I need that are
>> within the archive. This is useful, because some archive formats such as
>> .zip allow random access (so one does not need to read the entire zip file
>> in order to just read a single file from it).
>>
>> I've made an issue outlining how this might be designed -- would
>> appreciate any feedback or thoughts about how this might work!
>> https://issues.apache.org/jira/browse/BEAM-10111
>>
>