As far as I remember, we didn't intend to write the ColumnMetaData at the end of the Column Chunk. So this might be a case of the spec being ambiguous. Ed, are you referring to this illustration in the spec? I think here "Column 1 Chunk 1 + Column Metadata" I meant the chunk *and* its metadata but not necessarily in this order since metadata is intertwined with pages.
4-byte magic number "PAR1" <Column 1 Chunk 1 + Column Metadata> <Column 2 Chunk 1 + Column Metadata> ... <Column N Chunk 1 + Column Metadata> <Column 1 Chunk 2 + Column Metadata> <Column 2 Chunk 2 + Column Metadata> ... <Column N Chunk 2 + Column Metadata> ... <Column 1 Chunk M + Column Metadata> <Column 2 Chunk M + Column Metadata> ... <Column N Chunk M + Column Metadata> File Metadata 4-byte length in bytes of file metadata (little endian) 4-byte magic number "PAR1" On Mon, Jun 3, 2024 at 6:44 PM Gang Wu <ust...@gmail.com> wrote: > > modifying the spec to state that the ColumnMetaData following > > the chunk data is also optional > > +1 on this > > > adding language to the effect that if the value of file_offset is 0, > > then no such metadata is present in the file. > > What about marking this as deprecated and discouraged to use it? > > Best, > Gang > > > On Tue, Jun 4, 2024 at 1:59 AM Ed Seidl <etse...@live.com> wrote: > > > Hi all, > > While investigating a parquet-java issue with the file_offset field in > > ColumnChunk [1] I discovered that it appears parquet java does not (and > > perhaps never did?) write a copy of the ColumnMetaData following the > > column chunk data. This IMO violates the specification[2]. Instead, > > parquet-java seems to exclusively use the "optional" copy in the footer. > > Given that this issue has AFAICT never resulted in compatibility issues > > with other parquet readers, I'm wondering if it's safe to assume no one > > actually uses the mandated copy trailing the chunk data. In that case, > > would it make sense to modify the specification to match the reality on > > the ground? I would propose modifying the spec to state that the > > ColumnMetaData following the chunk data is also optional. Given that the > > file_offset field is required, I'd also propose adding language to the > > effect that if the value of file_offset is 0, then no such metadata is > > present in the file. > > > > Thoughts? > > > > Thanks, > > Ed > > > > [1] https://issues.apache.org/jira/browse/PARQUET-2139 > > [2] > > https://github.com/apache/parquet-format?tab=readme-ov-file#file-format > > >