martindurant commented on PR #9661:
URL: https://github.com/apache/arrow-rs/pull/9661#issuecomment-4199590013

   Some general comments on this effort, without details of the implementation.
   
   First, I think this is a GREAT idea, and something I wish I had had the time 
to start myself (the `kerchunk` project had aspirations for byte-range 
redirection in parquet). It is high time that the parquet format was given 
modern features beyond what a "layer over parquet" (i.e., iceberg) can do.
   
   - it should be mentioned somewhere, that the idea of the PR was (partly?) 
inspired by the blockwise deduplication possible via Xet. I don't know of 
another storage system that works quite the same; ipfs has content-addressing 
at the file level, for instance.
   - the metadata files are not legal parquet files, they cannot be loaded 
because they refer to byte ranges than don't exist. This means that none of the 
data is accessible without the specific code in here and that would pose a 
problem for adoption, I think. The metadata files essentiually give the schema 
evolution in a similar way to iceberg.
   - I think the format of the .page files is literally the binary of each page 
(header+def+rep+compress(values) ). That should be made clearer. These are also 
not valid parquet data by themselves
   - page statistics are (I think) stored only in the pages, but skippig 
loading a page file would be a great thing to be able to do, so it might make 
sense to surface these in a central place and even add other per-page 
information to allow skipping pages when reading
   - I could imagine combining the latter two points: the parquet files _could_ 
include all the pages. Right now, the hash is the filename, but you could have 
the pages in a real parquet file and only need to store the hash->offset,size 
information to get the benefits of dedup, but still allow the data to be read 
directly. You would need to make use of the ColumnChunk.file_path value. At 
least, I think I see a possibility.
   - (aside) where there is a structure like list[record[required: field1, 
required: field2]], the def and rep levels for the two leaf fields must be 
identical, so there are other duplications in the data; the reader should even 
need to load the second time, the offset/index arrays are the same.
   - dedup is even more important for remote storage. I realise you might be 
operating on locally mounted remotes, but direct interaction with remote 
storage and byte ranges I think should be considered. For instance, listing the 
3500 page files of the example pipeline would pose a significant runtime cost.
   - the idea here might work even better for the feather2 format and perhaps 
others. For feather2, in-file pointers/links are the norm (flatbuffers style) 
and instead of def and rep levels, you store the actual validity/index arrays, 
so they can be directly deduplicated separately. I think feather2 "chunk" sizes 
are probably more like row-groups than pages though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to