As far as I remember, we didn't intend to write the ColumnMetaData at the
end of the Column Chunk.
So this might be a case of the spec being ambiguous.
Ed, are you referring to this illustration in the spec?
I think here "Column 1 Chunk 1 + Column Metadata" I meant the chunk *and*
its metadata but not necessarily in this order since metadata is
intertwined with pages.

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"



On Mon, Jun 3, 2024 at 6:44 PM Gang Wu <ust...@gmail.com> wrote:

> > modifying the spec to state that the  ColumnMetaData following
> > the chunk data is also optional
>
> +1 on this
>
> > adding language to the effect that if the value of file_offset is 0,
> > then no such metadata is present in the file.
>
> What about marking this as deprecated and discouraged to use it?
>
> Best,
> Gang
>
>
> On Tue, Jun 4, 2024 at 1:59 AM Ed Seidl <etse...@live.com> wrote:
>
> > Hi all,
> > While investigating a parquet-java issue with the file_offset field in
> > ColumnChunk [1] I discovered that it appears parquet java does not (and
> > perhaps never did?) write a copy of the ColumnMetaData following the
> > column chunk data. This IMO violates the specification[2]. Instead,
> > parquet-java seems to exclusively use the "optional" copy in the footer.
> > Given that this issue has AFAICT never resulted in compatibility issues
> > with other parquet readers, I'm wondering if it's safe to assume no one
> > actually uses the mandated copy trailing the chunk data. In that case,
> > would it make sense to modify the specification to match the reality on
> > the ground? I would propose modifying the spec to state that the
> > ColumnMetaData following the chunk data is also optional. Given that the
> > file_offset field is required, I'd also propose adding language to the
> > effect that if the value of file_offset is 0, then no such metadata is
> > present in the file.
> >
> > Thoughts?
> >
> > Thanks,
> > Ed
> >
> > [1] https://issues.apache.org/jira/browse/PARQUET-2139
> > [2]
> > https://github.com/apache/parquet-format?tab=readme-ov-file#file-format
> >
>

Reply via email to