Julien, yes I'm referring to the diagram, as well as the wording that
follows it:
"The file metadata contains the locations of all the column metadata
start locations. More details on what is contained in the metadata can
be found in the Thrift definition.
Metadata is written after the data to allow for single pass writing."
My (perhaps incorrect) understanding of that is that the ColumnMetaData
is serialized after each column chunk, with the file_offset in the
ColumnChunk pointing to it. This appears to be what arrow-cpp implements
(and perhaps arrow-rs as well). Even the wording in the thrift file
seems to indicate the primary copy is not in the footer, with the
version in the footer being an optional duplicate.
So does "Column Metadata" in the diagram just refer to the page headers?
Cheers,
Ed
On 6/4/24 6:20 PM, Julien Le Dem wrote:
As far as I remember, we didn't intend to write the ColumnMetaData at the
end of the Column Chunk.
So this might be a case of the spec being ambiguous.
Ed, are you referring to this illustration in the spec?
I think here "Column 1 Chunk 1 + Column Metadata" I meant the chunk *and*
its metadata but not necessarily in this order since metadata is
intertwined with pages.
4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"
On Mon, Jun 3, 2024 at 6:44 PM Gang Wu <ust...@gmail.com> wrote:
modifying the spec to state that the ColumnMetaData following
the chunk data is also optional
+1 on this
adding language to the effect that if the value of file_offset is 0,
then no such metadata is present in the file.
What about marking this as deprecated and discouraged to use it?
Best,
Gang
On Tue, Jun 4, 2024 at 1:59 AM Ed Seidl <etse...@live.com> wrote:
Hi all,
While investigating a parquet-java issue with the file_offset field in
ColumnChunk [1] I discovered that it appears parquet java does not (and
perhaps never did?) write a copy of the ColumnMetaData following the
column chunk data. This IMO violates the specification[2]. Instead,
parquet-java seems to exclusively use the "optional" copy in the footer.
Given that this issue has AFAICT never resulted in compatibility issues
with other parquet readers, I'm wondering if it's safe to assume no one
actually uses the mandated copy trailing the chunk data. In that case,
would it make sense to modify the specification to match the reality on
the ground? I would propose modifying the spec to state that the
ColumnMetaData following the chunk data is also optional. Given that the
file_offset field is required, I'd also propose adding language to the
effect that if the value of file_offset is 0, then no such metadata is
present in the file.
Thoughts?
Thanks,
Ed
[1] https://issues.apache.org/jira/browse/PARQUET-2139
[2]
https://github.com/apache/parquet-format?tab=readme-ov-file#file-format