Hi Gang,
    Thanks - the historical context definitely makes sense and I hear your
concern about breaking existing links. One thing I observed though, is that
this choice also makes Parquet a bit unique in this space.

For example, Iceberg's Table spec (https://iceberg.apache.org/spec/) and
Puffin (https://iceberg.apache.org/puffin-spec/) exist solely on the
website and not in a separate repo. Avro's spec (
https://avro.apache.org/docs/1.11.1/) is in the same situation. Arrow does
the same: https://arrow.apache.org/docs/format/Columnar.html in a versioned
way (last version: https://arrow.apache.org/docs/14.0/format/Columnar.html
).

Orc seems to have just recently (3 months ago) introduced an orc-format
repo, though their specs are also published in a versioned way on the
website: https://orc.apache.org/specification/ORCv0/,
https://orc.apache.org/specification/ORCv1/, and even their draft one:
https://orc.apache.org/specification/ORCv2/. It may be worth talking to
them about why they choose to do this.

Regarding parquet-format, I'm not suggesting that we outright remove it,
but I think there may be value in archiving the repo (so that it's read
only) and doing the work moving forward on the website, just as Iceberg and
Avro seem to do. It could also be a personal bias, but I think the website
offers a bit more flexibility and readability than navigating through
individual markdown files on the repo. We're also using docsy as our
template (as it seems Avro is) so it shouldn't be too crazy to adopt their
model.

Thanks, Vinoo


<vinoo.gan...@gmail.com>


On Tue, Mar 5, 2024 at 10:08 PM Gang Wu <ust...@gmail.com> wrote:

> Hi Vinoo,
>
> IMO, we cannot do this because the parquet-format repo serves as the
> dedicated place to hold the parquet specs, which includes the thrift
> definition file and a set of documents tagged for all versions. Some
> projects
> also directly reference the link of the markdown files, which will be
> broken
> if we remove the repo. Even for the deprecated Java code you mentioned
> above, I remember that someone told me the code may still be used by
> legacy projects. So it would not be easy to do such a move.
>
> Best,
> Gang
>
> On Wed, Mar 6, 2024 at 10:31 AM Vinoo Ganesh <vinoo.gan...@gmail.com>
> wrote:
>
>> Hi Parquet Dev -
>>
>> There have been some conversations about content stored on the
>> parquet-format github repo vs. the website. Doing a cursory pass of the
>> parquet-format <https://github.com/apache/parquet-format> repo, it looks
>> like, other than the markdown documentation stored in the repo, most of the
>> core code was marked as deprecated here:
>> https://github.com/apache/parquet-format/pull/105, content was moved to
>> parquet-mr, and that entire repo really only exists to host this file:
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift.
>> It's possible I'm missing something, but is my understanding correct?
>>
>> If so, would it make sense to just deprecate parquet-format as a repo,
>> move the content to be exclusively hosted on parquet-site
>> <https://github.com/apache/parquet-site/tree/asf-site>, and host the
>> thrift file elsewhere? This would solve the content duplication problem
>> between parquet format and the website, and would cut down on having to
>> manage a separate repo. I know there is benefit to having
>> comments/discussions on PRs or issues on the repo, but we could also pretty
>> easily port this to the site.
>>
>> I'm sure this proposal will elicit some strong responses, but wanted to
>> see if anyone had insights here / if I'm missing anything.
>>
>> Thanks, Vinoo
>>
>>
>> <vinoo.gan...@gmail.com>
>>
>

Reply via email to