I think the names of classes in the code can different than how the spec
refers to the concepts, if the maintainers don't mind. In my mind, changing
the parquet.thrift file to use consistent terminology doesn't change the
spec, nor will it require (or prevent) implementations from changing their
internal class names.



On Wed, May 29, 2024 at 11:09 AM Gang Wu <ust...@gmail.com> wrote:

> Hi,
>
> I agree that row sounds clearer than record, however we have a class
> RecordReader in the parquet cpp: [1]. Not sure if we need to rename
> it and it is still considered an internal class.
>
> [1]
>
> https://github.com/apache/arrow/blob/4a2df663bc88c73b863e0c0036160f7f936574c2/cpp/src/parquet/column_reader.h#L312
>
> Best,
> Gang
>
> On Wed, May 29, 2024 at 8:44 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > I agree that "row" is a more widespread terminology while "record" can
> > be a bit head-scratching.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Wed, 29 May 2024 05:49:22 -0400
> > Andrew Lamb <andrewlam...@gmail.com>
> > wrote:
> > > In the context of my PR trying to encode the consensus that records
> can't
> > > span page boundaries[1], Antoine brought up the excellent point[2] that
> > the
> > > format[3] seems to use the terms "records" and "rows" to refer to the
> > same
> > > concept.
> > >
> > > I agree it would clarify the spec to use the same terminology
> throughout.
> > > Given there are several fields named `num_rows` I propose changing
> > > parquet.thrift to use the term "row" throughout.
> > >
> > > I can make another PR to do so if this seems like a good idea.
> > >
> > > Andrew
> > > (p.s the PR[1] is still waiting on some more review and merging :pray:)
> > >
> > > [1] https://github.com/apache/parquet-format/pull/244
> > > [2]
> > https://github.com/apache/parquet-format/pull/244#discussion_r1617320495
> > > [3]
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> > >
> >
> >
> >
> >
>

Reply via email to