martindurant commented on PR #9661: URL: https://github.com/apache/arrow-rs/pull/9661#issuecomment-4199590013
Some general comments on this effort, without details of the implementation. First, I think this is a GREAT idea, and something I wish I had had the time to start myself (the `kerchunk` project had aspirations for byte-range redirection in parquet). It is high time that the parquet format was given modern features beyond what a "layer over parquet" (i.e., iceberg) can do. - it should be mentioned somewhere, that the idea of the PR was (partly?) inspired by the blockwise deduplication possible via Xet. I don't know of another storage system that works quite the same; ipfs has content-addressing at the file level, for instance. - the metadata files are not legal parquet files, they cannot be loaded because they refer to byte ranges than don't exist. This means that none of the data is accessible without the specific code in here and that would pose a problem for adoption, I think. The metadata files essentiually give the schema evolution in a similar way to iceberg. - I think the format of the .page files is literally the binary of each page (header+def+rep+compress(values) ). That should be made clearer. These are also not valid parquet data by themselves - page statistics are (I think) stored only in the pages, but skippig loading a page file would be a great thing to be able to do, so it might make sense to surface these in a central place and even add other per-page information to allow skipping pages when reading - I could imagine combining the latter two points: the parquet files _could_ include all the pages. Right now, the hash is the filename, but you could have the pages in a real parquet file and only need to store the hash->offset,size information to get the benefits of dedup, but still allow the data to be read directly. You would need to make use of the ColumnChunk.file_path value. At least, I think I see a possibility. - (aside) where there is a structure like list[record[required: field1, required: field2]], the def and rep levels for the two leaf fields must be identical, so there are other duplications in the data; the reader should even need to load the second time, the offset/index arrays are the same. - dedup is even more important for remote storage. I realise you might be operating on locally mounted remotes, but direct interaction with remote storage and byte ranges I think should be considered. For instance, listing the 3500 page files of the example pipeline would pose a significant runtime cost. - the idea here might work even better for the feather2 format and perhaps others. For feather2, in-file pointers/links are the norm (flatbuffers style) and instead of def and rep levels, you store the actual validity/index arrays, so they can be directly deduplicated separately. I think feather2 "chunk" sizes are probably more like row-groups than pages though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
