Re: [PR] PARQUET-2310: implementation status [parquet-site]
alamb commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2152291093 > > I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row. > > I worry about this becoming a little bit of a pedantic discussion. My 2 cents: I think a reasonable approach is to let maintainers decide on checking the box or not. Maybe we can have a ternary value. If a maintainer feels it is fully supported it is a check. If they think reasonable people might be confused it gets '-' or check without footnote to explain the exception. If it is completely not supported it gets an 'X'. We can always adjust criteria as we get feedback. I agree that it would be best to start with a relatively lax, low barrier to entry for self reporting Over time we can add more stringency (ideally with automated checks) if/when that would add value -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2310: implementation status [parquet-site]
emkornfield commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2151393203 > I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row. I worry about this becoming a little bit of a pedantic discussion. My 2 cents: I think a reasonable approach is to let maintainers decide on checking the box or not. Maybe we can have a ternary value. If a maintainer feels it is fully supported it is a check. If they think reasonable people might be confused it gets '-' or check without footnote to explain the exception. If it is completely not supported it gets an 'X'. We can always adjust criteria as we get feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] DRAFT: PARQUET-2489: Strawman proposal for releases [parquet-site]
emkornfield opened a new pull request, #61: URL: https://github.com/apache/parquet-site/pull/61 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2310: implementation status [parquet-site]
alippai commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147981789 I like both! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2310: implementation status [parquet-site]
alamb commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147973327 > Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cudf) Echoing what @pitrou said, I suggest we add a note somewhere that "this table includes any open source currently maintained implementation of parquet whose maintainers have helped fill it out. If you wish to add a new implementation to this table, please open a PR to do so" It might also be worth adding a column with some sort of placeholder (`?` for example) for those implementation (as a way of encouraging their help). However, that might be a good thing to do as a follow on PR as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2310: implementation status [parquet-site]
pitrou commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147379889 IMHO, any currently maintained open source implementation of Parquet deserves mentioning there. But that also requires involvement from their respective maintainers (we shouldn't expect us Parquet maintainers to make sure the information is up to date). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2310: implementation status [parquet-site]
alippai commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147369156 Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cuff) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
emkornfield merged PR #60: URL: https://github.com/apache/parquet-site/pull/60 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
emkornfield commented on PR #60: URL: https://github.com/apache/parquet-site/pull/60#issuecomment-2125731630 LGTM, thank you @vinooganesh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
alamb commented on code in PR #60: URL: https://github.com/apache/parquet-site/pull/60#discussion_r1610633755 ## content/en/_index.html: ## @@ -24,8 +24,8 @@ {{% /blocks/feature %}} -{{% blocks/feature icon="fab fa-github" title="Contributions welcome!" url="https://github.com/apache/parquet-mr; %}} -We do a [Pull Request](https://github.com/apache/parquet-mr/pulls) contributions workflow on **GitHub**. New users are always welcome! +{{% blocks/feature icon="fab fa-github" title="Contributions welcome!" url="https://github.com/apache/parquet-java; %}} Review Comment: I think it is something to revisit as a follow on PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
alamb commented on code in PR #60: URL: https://github.com/apache/parquet-site/pull/60#discussion_r1610633516 ## README.md: ## @@ -63,7 +63,7 @@ You can now preview the site locally on http://localhost:1313/ To create documentation for a new release of `parquet-format` create a new .md file under `content/en/blog/parquet-format`. Please see existing files in that directory as an example. -To create documentation for a new release of `parquet-mr` create a new .md file under `content/en/blog/parquet-mr`. Please see existing files in that directory as an example. +To create documentation for a new release of `parquet-java` create a new .md file under `content/en/blog/parquet-java`. Please see existing files in that directory as an example. Review Comment: I don't feel strongly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
vinooganesh commented on code in PR #60: URL: https://github.com/apache/parquet-site/pull/60#discussion_r1609027971 ## content/en/_index.html: ## @@ -24,8 +24,8 @@ {{% /blocks/feature %}} -{{% blocks/feature icon="fab fa-github" title="Contributions welcome!" url="https://github.com/apache/parquet-mr; %}} -We do a [Pull Request](https://github.com/apache/parquet-mr/pulls) contributions workflow on **GitHub**. New users are always welcome! +{{% blocks/feature icon="fab fa-github" title="Contributions welcome!" url="https://github.com/apache/parquet-java; %}} Review Comment: That's a great point. This was actually that was torn about the most when I first built the new site. I figured more people would want to contribute to `mr` than `format` (there's actual code in the former), so I went with `mr` everywhere. More than happy to revisit this as it was mostly just a guess on my part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
vinooganesh commented on code in PR #60: URL: https://github.com/apache/parquet-site/pull/60#discussion_r1609020796 ## README.md: ## @@ -63,7 +63,7 @@ You can now preview the site locally on http://localhost:1313/ To create documentation for a new release of `parquet-format` create a new .md file under `content/en/blog/parquet-format`. Please see existing files in that directory as an example. -To create documentation for a new release of `parquet-mr` create a new .md file under `content/en/blog/parquet-mr`. Please see existing files in that directory as an example. +To create documentation for a new release of `parquet-java` create a new .md file under `content/en/blog/parquet-java`. Please see existing files in that directory as an example. Review Comment: Ah I see the confusion - these notes have to do with updating the website announce the new release: https://parquet.apache.org/blog/. So the flow would be 1. Make a release of parquet-java in that repo 2. Put up a blog post entry on the website containing the release information Happy to remove this if folks feel strongly - but was thinking it may be good to have some instructions on how to actually make the post. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
vinooganesh commented on code in PR #60: URL: https://github.com/apache/parquet-site/pull/60#discussion_r1609020796 ## README.md: ## @@ -63,7 +63,7 @@ You can now preview the site locally on http://localhost:1313/ To create documentation for a new release of `parquet-format` create a new .md file under `content/en/blog/parquet-format`. Please see existing files in that directory as an example. -To create documentation for a new release of `parquet-mr` create a new .md file under `content/en/blog/parquet-mr`. Please see existing files in that directory as an example. +To create documentation for a new release of `parquet-java` create a new .md file under `content/en/blog/parquet-java`. Please see existing files in that directory as an example. Review Comment: Ah I see the confusion - these notes have to do with updating the website announce the new release: https://parquet.apache.org/blog/. So the flow would be 1. Make a release of parquet-java in that repo 2. Put up a blog post entry on the website containing the release information -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
vinooganesh commented on PR #60: URL: https://github.com/apache/parquet-site/pull/60#issuecomment-2123541356 Thanks @alamb! > Should we also do a sweep and update the contribution guidelines to include the new ways to contribute? > I am not sure what this is asking Sorry -- this is a typo on my side. I meant to include a new contribution template (edited the message above). It was a response to this thread: https://lists.apache.org/thread/5oohcx3m16kqs8dmtl3vm1cgd8z0q10b. It's probably worth having separate release announcement templates for `parquet-format` and `parquet-java`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
alamb commented on code in PR #60: URL: https://github.com/apache/parquet-site/pull/60#discussion_r1608995912 ## content/en/_index.html: ## @@ -24,8 +24,8 @@ {{% /blocks/feature %}} -{{% blocks/feature icon="fab fa-github" title="Contributions welcome!" url="https://github.com/apache/parquet-mr; %}} -We do a [Pull Request](https://github.com/apache/parquet-mr/pulls) contributions workflow on **GitHub**. New users are always welcome! +{{% blocks/feature icon="fab fa-github" title="Contributions welcome!" url="https://github.com/apache/parquet-java; %}} Review Comment: As part of another PR perhaps we should revisit this link (perhaps it should link to parquet-format?) as again linking to the java implementation from the homepage might be more confusing than helpful ## README.md: ## @@ -63,7 +63,7 @@ You can now preview the site locally on http://localhost:1313/ To create documentation for a new release of `parquet-format` create a new .md file under `content/en/blog/parquet-format`. Please see existing files in that directory as an example. -To create documentation for a new release of `parquet-mr` create a new .md file under `content/en/blog/parquet-mr`. Please see existing files in that directory as an example. +To create documentation for a new release of `parquet-java` create a new .md file under `content/en/blog/parquet-java`. Please see existing files in that directory as an example. Review Comment: I would personally suggest removing the discussion bout release of `parquet-mr`/ `parquet-java` to that repo. It seems confusing to have instructions on how to do a release from another repo in `parquet-site` ## content/en/docs/Overview/_index.md: ## @@ -18,14 +18,14 @@ The parquet-format repository hosts the official specification of the Apache Par As a repository focused on specification, the parquet-format repository does not contain source code. -### parquet-mr +### parquet-java Review Comment: i agree adding a note like this would be clearer ``` The parquet-java repository(previously named `parquet-mr`) is part of the Apache Parquet project and specifically focuses on providing Java tools for ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
alamb commented on PR #60: URL: https://github.com/apache/parquet-site/pull/60#issuecomment-2123503469 > @alamb @wgtmac I put a very basic PR together to update _some_ of the references on the website from `parquet-mr` to `parquet-java`. I only chose to do some because I think we have a few questions to figure out first: > > 1. Are we going to change the published artifact name of `parquet-mr` to `parquet-java` or do we just want to keep publishing under mr? I personally suggest not making this change unless there is a compelling usecase. It seems like it doesn't hurt to leave the artifacts as parquet-mr and would only cause downstream pain to update them now for very little gain > 2. Do we want to actually "rewrite history" and update the past references (contributions, etc..) in the docs to refer to `parquet-java` instead? I'm not a fan of rewriting history but figured I'd start a conversation just in case people want to. I recommend against doing this, again on the justification of "what benefit would we get from it"? > 3. Should we also do a sweep and update the contribution guidelines to include the new ways to contribute? I am not sure what this is asking > 4. Should we introduce a new section of the blog called `parquet-java` (I had been hacking using the blog for releases) to add a note (assuming we change the name of the artifact) that things have changed? Maybe we could create a blog post announcing some of the recent changes / activity (e.g. discussion son V3 format, clarifications on repos, new website, etc). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
emkornfield commented on code in PR #60: URL: https://github.com/apache/parquet-site/pull/60#discussion_r1608936158 ## content/en/docs/Overview/_index.md: ## @@ -18,14 +18,14 @@ The parquet-format repository hosts the official specification of the Apache Par As a repository focused on specification, the parquet-format repository does not contain source code. -### parquet-mr +### parquet-java Review Comment: maybe note here that this was previously referred to as parquet-mr due to the name of the repository (which has also been moved)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Updating refereces from parquet-mr -> parquet-java [parquet-site]
vinooganesh opened a new pull request, #60: URL: https://github.com/apache/parquet-site/pull/60 @alamb @wgtmac I put a very basic PR together to update *some* of the references on the website from `parquet-mr` to `parquet-java`. I only chose to do some because I think we have a few questions to figure out first: 1. Are we going to change the published artifact name of `parquet-mr` to `parquet-java` or do we just want to keep publishing under mr? 2. Do we want to actually "rewrite history" and update the past references (contributions, etc..) in the docs to refer to `parquet-java` instead? I'm not a fan of rewriting history but figured I'd start a conversation just in case people want to. 3. Should we also do a sweep and update the contribution guidelines to include the new ways to contribute? 4. Should we introduce a new section of the blog called `parquet-java` (I had been hacking using the blog for releases) to add a note (assuming we change the name of the artifact) that things have changed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
julienledem commented on PR #59: URL: https://github.com/apache/parquet-site/pull/59#issuecomment-2116406982 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
alamb commented on PR #59: URL: https://github.com/apache/parquet-site/pull/59#issuecomment-2115741244 Thanks @wgtmac -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
wgtmac commented on PR #59: URL: https://github.com/apache/parquet-site/pull/59#issuecomment-2115380985 Let me merge this. Thanks everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
wgtmac merged PR #59: URL: https://github.com/apache/parquet-site/pull/59 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
vinooganesh commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1602060617 ## content/en/docs/Overview/_index.md: ## @@ -6,11 +6,11 @@ description: > All about Parquet. --- -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. +Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. +It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools. Review Comment: No really strong feelings, was just wondering if there was a subtextual focus intended -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
alamb commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1602019285 ## content/en/docs/Overview/_index.md: ## @@ -6,11 +6,11 @@ description: > All about Parquet. --- -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. +Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. +It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools. Review Comment: I didn't mean for the comma or lack there of to carry any additional semantic meaning. I am happy to put a comma there if you like -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
vinooganesh commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1601981340 ## content/en/docs/Overview/_index.md: ## @@ -6,11 +6,11 @@ description: > All about Parquet. --- -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. +Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. +It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools. Review Comment: Did we mean for this to say "high performance compression" or is it "high performance, compression"? I think it may be the latter. Or maybe "It provides performant compression and encoding schemes..." I was thinking the first versions sound too much like the compression tool rather than the format -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
vinooganesh commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1601981340 ## content/en/docs/Overview/_index.md: ## @@ -6,11 +6,11 @@ description: > All about Parquet. --- -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. +Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. +It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools. Review Comment: Did we mean for this to say "high performance compression" or is it "high performance, compression"? I think it may be the latter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
alamb commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1601975186 ## content/en/_index.md: ## @@ -9,7 +9,10 @@ title: Parquet Download -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + Review Comment: Text is updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1601930883 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,30 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### Parquet Format + +The "Parquet Format" project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### Parquet MR + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + + Parquet MR can be thought of the a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md). Review Comment: To follow up -- we are discussing reference implementations on the mailing list: https://lists.apache.org/thread/f9379yx0lf5gtpkgyv922pvowtzy4kmm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
alamb commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2112691416 > Ah okay, so it does seem like we need to clear out the output directory of `asf-staging`: https://parquet.staged.apache.org/ I suggest we make a PR that directly targets that branch that leaves a pointer to https://parquet.apache.org/ for anyone who stumbles on it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
vinooganesh commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2112682297 Ah okay, so it does seem like we need to clear out the output directory of `asf-staging`: https://parquet.staged.apache.org/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
wgtmac merged PR #58: URL: https://github.com/apache/parquet-site/pull/58 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
wgtmac commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2112664342 Sure, let me merge this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
vinooganesh commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2112458273 I think we can merge as is, if only to see what the behavior will be. It seems pretty low risk, especially since we still have the branches and can revert if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on PR #53: URL: https://github.com/apache/parquet-site/pull/53#issuecomment-2112434241 Thanks everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
wgtmac commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2112333016 > > I think I can delete the staging branch. Before that, should we send a notice to the dev ML in case there is any objection? Maybe we can set a deadline and proceed after that. I don't think this worth a formal vote. > > I agree -- either an email or JIRA ticket is probably good to record the rationale and decision in a more easily discoverable location > > > BTW, should we do anything to remove the site: https://parquet.staged.apache.org/? Or it will be removed automatically after we are done? > > I suggest we make one final PR to the staging branch to have it push some sort of notice or redirect so https://parquet.staged.apache.org/ redirects to https://parquet.apache.org/ > > I don't know how to remove it Do we want to make additional change, or it is good to merge as is? @vinooganesh @alamb I can try to delete staging-related branches once you think fit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
wgtmac commented on PR #53: URL: https://github.com/apache/parquet-site/pull/53#issuecomment-2112328673 I just merged it. Thanks everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
wgtmac merged PR #53: URL: https://github.com/apache/parquet-site/pull/53 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on PR #53: URL: https://github.com/apache/parquet-site/pull/53#issuecomment-2112312510 I think we should merge this PR and begin working on the next steps (feature compatibility matrix) This is quite an impressive list of ✅ ![Screenshot 2024-05-15 at 7 44 35 AM](https://github.com/apache/parquet-site/assets/490673/690e6b2b-85e7-4253-abde-3cdae98eeb17) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
wgtmac commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2111443901 cc @gszadovszky @julienledem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1599985436 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. + +Included in parquet-mr: +* Java Implementation: It contains the core Java implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. + +* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. + + +### Other Clients / Libraries / Tools + +The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools. + +Here is a non-exhaustive list of Parquet implementations: + +* [parquet-mr](https://github.com/apache/parquet-mr) +* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) +* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) Review Comment: FYI this work is tracked by https://issues.apache.org/jira/browse/PARQUET-2310 and there is a draft at https://github.com/apache/parquet-site/pull/34. Once this PR gets merged we'll start working on that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
crepererum commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599844055 ## content/en/docs/Overview/_index.md: ## @@ -6,4 +6,7 @@ description: > All about Parquet. --- -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. +Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. +It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. +Parquet is available in multiple languages including Java, C++, and Python. Review Comment: I think mentioning implementation (both as end-user software and as libs) is valuable but shouldn't be part of the elevator pitch. Other formats usually solve this by a dedicated sub-section or page, e.g.: - https://jpeg.org/jpegxl/software.html (the list format is good, the fact that there's only a single implementation is not) - https://paseto.io/ - https://autocrypt.org/dev-status.html This would also allow multiple implementations for a single language, which sometimes can be valuable (e.g. if you have a backwards compatible, conservative variant and a fancy new one). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
alamb commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599769911 ## content/en/_index.md: ## @@ -9,7 +9,10 @@ title: Parquet Download -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + Review Comment: My plan is to wait another day or so for any additional comments, and if I don't hear any I will update this PR so that all three locations use the generic "Parquet is supported in many programming language and analytics tools." phrasing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
amoeba commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599097041 ## content/en/_index.md: ## @@ -9,7 +9,10 @@ title: Parquet Download -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + Review Comment: SGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
vinooganesh commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2108802214 I initially built this to publish to the `output/` dir per https://github.com/apache/infrastructure-asfyaml/tree/main?tab=readme-ov-file#pelican-sub-directories-for-static-output. see: https://github.com/apache/parquet-site/blob/staging/.github/workflows/deploy.yml#L46. We may want to publish an empty commit to the `output` dir of the `asf-staging` branch as just a sanity check cleanup too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
alamb commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599070791 ## content/en/_index.md: ## @@ -9,7 +9,10 @@ title: Parquet Download -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + Review Comment: I think the generic sentiment of wide support is good I tried to remove the specific Java/C++/Go tech references which are on https://projects.apache.org/project.html?parquet (comes from https://github.com/apache/parquet-site/blob/production/static/doap_Parquet.rdf it seems) What do you think about changing all three locations (DOAP, this, and the overview @etseidl mentions in https://github.com/apache/parquet-site/pull/59/files#r1599056646) to use the more generic phrasing? > "Parquet is supported in many programming language and analytics tools." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
alamb commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599066379 ## content/en/_index.md: ## @@ -9,7 +9,10 @@ title: Parquet Download -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. +It provides high performance data compression and encoding schemes with to handle complex data in bulk. Review Comment: nice catch -- I actually did this locally but forgot to push the change 臘 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
amoeba commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599045106 ## content/en/_index.md: ## @@ -9,7 +9,10 @@ title: Parquet Download -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + Review Comment: At the risk of bike-shedding this (+1 as-is), I think it might be good to indicate that Parquet has readers/writers in many languages and in many tools. In the recent [IANA registration](https://www.iana.org/assignments/media-types/application/vnd.apache.parquet) I went with the vague "used across a wide variety of platforms, technologies, and environments.". But here maybe, > "Parquet is supported in many programming language and analytics tools." What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
vinooganesh commented on PR #59: URL: https://github.com/apache/parquet-site/pull/59#issuecomment-2108729440 +1! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
alamb commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599034830 ## content/en/_index.md: ## @@ -9,7 +9,10 @@ title: Parquet Download -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + Review Comment: I wordsmithed the landing page a little to reduce its length and make it flow better. I can make it exactly mirror the DOAP text if reviewers prefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]
alamb opened a new pull request, #59: URL: https://github.com/apache/parquet-site/pull/59 # Rationale As described on https://issues.apache.org/jira/browse/PARQUET-2470, Parquet's role in the analytics ecosystem is substantial. However, https://parquet.apache.org/ currently emphasis Parquet's role in the Hadoop ecosystem. I think this causes confusion in several ways: 1. It implies that parquet is only focused on Hadoop, when I think it is a critical technology across other ecosystems that are unrelated to hadoop (e.g. Apache Iceberg, Delta Lake, etc) 2. It may further the perception that the Apache Parquet project only focuses on / cares about Hadoop / Java implementation # Chanages Update the home page content to mirror the Apache Project Description https://projects.apache.org/project.html?parquet (which does not mention Hadoop specifically) > Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, and Python. ## Before this PR ![Screenshot 2024-05-13 at 4 13 31 PM](https://github.com/apache/parquet-site/assets/490673/86a76878-f304-4d43-8156-a3555ccebfbc) ## After the PR ![Screenshot 2024-05-13 at 4 15 17 PM](https://github.com/apache/parquet-site/assets/490673/7479dd8f-3054-410e-9c14-4a8d2a0dccaa) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1599015648 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: I started a thread on the mailing list about this topic to see if we can reach consensus: https://lists.apache.org/thread/f9379yx0lf5gtpkgyv922pvowtzy4kmm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598594225 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,40 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) repositories. + + +### parquet-format + +The parquet-format repository hosts the official specification of the Apache Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. The "mr" stands for MapReduce. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Apache Parquet files. + +The parquet-mr repository contains an implementation of the Apache Parquet format. There are a number of other Parquet format implementations, which are listed below. + +Included in parquet-mr: +* Java Implementation: It contains the core Java implementation of the Apache Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. + +* Utilities and APIs: It provides various utilities and APIs for working with Apache Parquet files, including tools for data import/export, schema management, and data conversion. + + +### Other Clients / Libraries / Tools + +The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools. + +Here is a non-exhaustive list of Parquet implementations: + +* [Parquet-mr](https://github.com/apache/parquet-mr) +* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) +* [Parquet Go, a subproject for Arrow Go](https://github.com/apache/arrow/tree/main/go/parquet) ([documentation](https://github.com/apache/arrow/tree/main/go)) +* [Parquet Rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) +* [cudf](https://github.com/rapidsai/cudf) Review Comment: Ahh, thanks @etseidl ! I didn't realize this is was the stylized version -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
etseidl commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598588034 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,40 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) repositories. + + +### parquet-format + +The parquet-format repository hosts the official specification of the Apache Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. The "mr" stands for MapReduce. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Apache Parquet files. + +The parquet-mr repository contains an implementation of the Apache Parquet format. There are a number of other Parquet format implementations, which are listed below. + +Included in parquet-mr: +* Java Implementation: It contains the core Java implementation of the Apache Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. + +* Utilities and APIs: It provides various utilities and APIs for working with Apache Parquet files, including tools for data import/export, schema management, and data conversion. + + +### Other Clients / Libraries / Tools + +The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools. + +Here is a non-exhaustive list of Parquet implementations: + +* [Parquet-mr](https://github.com/apache/parquet-mr) +* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) +* [Parquet Go, a subproject for Arrow Go](https://github.com/apache/arrow/tree/main/go/parquet) ([documentation](https://github.com/apache/arrow/tree/main/go)) +* [Parquet Rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) +* [cudf](https://github.com/rapidsai/cudf) Review Comment: ```suggestion * [cuDF](https://github.com/rapidsai/cudf) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598499413 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: cc @gszadovszky @Fokko @xhochy @wgtmac. I don't want to further block this PR by settling this beforehand, so I'm going to remove the word "reference" and we can add it back if we want to in a subsequent PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598499413 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: cc @gszadovszky @Fokko @xhochy @wgtmac. I don't want to further block this PR by settling this beforehand, so I'm going to remove the word "reference" and we can add it back if we want to. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598495733 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. Review Comment: @pitrou I assume it's mapreduce, but please correct me if I'm wrong -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
pitrou commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598406934 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: If we don't agree on what a reference implementation then we should not list parquet-mr as a reference implementation. The term "reference implementation" has an official connotation and implies a specific status; it certainly should not be assigned lightly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598331634 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: There have been a lot of conversations about this: https://github.com/apache/parquet-site/pull/53#discussion_r1582882267 (and others on the thread) and I'm inclined to keep this as is. I don't think we need to exhaustively list other the reference reference implementing when there is a list of implementations below. @gszadovszky has also called this a reference implementation and I think it helps clarify the relationship between the `parquet-format` and `parquet-mr`. I'm more than happy to update this once the community has reached a consensus after the mailing list discussion that @alamb suggested though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598329089 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. Review Comment: I can make this change. These are referred to publicly as both projects and repo (in our mailing list as well) so I deliberately put both in. I'll stick with repository though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598319137 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: If we knew what those references implementations are, I agree it would be valuable to document. However, I think there is consensus required before we made such a determination Thus, for this PR I suggest: 1. Remove the word "reference" 2. File a follow on ticket / discussion in the mailing list to figure out what should be listed as references implementations ```suggestion The parquet-mr repo contains an implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598315092 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. + +Included in parquet-mr: +* Java Implementation: It contains the core Java implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. + +* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. + + +### Other Clients / Libraries / Tools + +The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools. + +Here is a non-exhaustive list of Parquet implementations: + +* [parquet-mr](https://github.com/apache/parquet-mr) +* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) +* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) Review Comment: I recommend that (as a follow on PR) we turn this list into a table, something like | Project | Language | Website | API Docs | ||||| | parquet-mr | Java | link | | parquet-cpp | C++ | link |link | | parquet-rs | Rust | link | link | Also I recommend the criteria for being listed here is "Open source implementations of the parquet format" (which is a low bar to be sure I would be happy to propose such changes as a follow on PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on PR #53: URL: https://github.com/apache/parquet-site/pull/53#issuecomment-2107320484 Given the size and substance of this PR, it is unlikely we will get it perfect the first time. Also, I don't see any disagreement across commenters on the value of this information. I would personally suggest that we address all the outstanding comments as best as possible, merge this PR, and then iterate on the content in subsequent PRs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
jorisvandenbossche commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598301378 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. + +Included in parquet-mr: +* Java Implementation: It contains the core Java implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. + +* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. + + +### Other Clients / Libraries / Tools + +The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools. + +Here is a non-exhaustive list of Parquet implementations: + +* [parquet-mr](https://github.com/apache/parquet-mr) +* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) +* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) Review Comment: And could also mention for Go (and similarly for rust below) that it is a subproject of Arrow Go, similarly like done for C++ above (it also seems there are several Parquet Go implementations, others that are not part of the Arrow project, so it's good to clarify this one is Arrow related. But at the same time it's not entirely clear what the criterium is for being listed here ..) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
pitrou commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1598282487 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. Review Comment: Can we please make the terminology consistent? Either describe both parquet-format and parquet-mr as "projects" or as "GitHub repositories", but not one and the other. ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: It's "a" reference implementation, it means there are other ones. But I don't see them listed here. Either list all reference implementations explicitly, or make this "the" reference implementation. ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. Review Comment: Also, can we explain what "mr" stands for? It's a mystery for most people. ##
Re: [PR] Remove staging [parquet-site]
alamb commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2107096367 > I think I can delete the staging branch. Before that, should we send a notice to the dev ML in case there is any objection? Maybe we can set a deadline and proceed after that. I don't think this worth a formal vote. I agree -- either an email or JIRA ticket is probably good to record the rationale and decision in a more easily discoverable location > BTW, should we do anything to remove the site: https://parquet.staged.apache.org/? Or it will be removed automatically after we are done? I suggest we make one final PR to the staging branch to have it push some sort of notice or redirect so https://parquet.staged.apache.org/ redirects to https://parquet.apache.org/ I don't know how to remove it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
jorisvandenbossche commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1597944460 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. + +Included in parquet-mr: +* Java Implementation: It contains the core Java implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. + +* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. + + +### Other Clients / Libraries / Tools + +The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools. + +Here is a non-exhaustive list of Parquet implementations: + +* [parquet-mr](https://github.com/apache/parquet-mr) +* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) +* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) +* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) +* [cudf](https://github.com/rapidsai/cudf) +* [apache impala](https://github.com/apache/impala) +* [duckdb](https://github.com/duckdb/duckdb) +* [fast-parquet python](https://github.com/dask/fastparquet) +* [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) Review Comment: ```suggestion ``` It's twice in the list -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
wgtmac commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2106471367 I think I can delete the staging branch. Before that, should we send a notice to the dev ML in case there is any objection? Maybe we can set a deadline and proceed after that. I don't think this worth a formal vote. BTW, should we do anything to remove the site: https://parquet.staged.apache.org? Or it will be removed automatically after we are done? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
alamb commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2106397019 > The main thing I'm curious about is whether the PMC can delete branches easily from github. If so, it maybe much more straightforward, otherwise will have to file INFRA tickets In the arrow and datafusion repos, any committer can delete any branch other than the "protected" one (typically the main one) Thus I suspect someone like @wgtmac could do so in this repo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
vinooganesh commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2106379277 Yep, there is actually a sequencing of things that need to happen here: 1. Deleting the `asf-staging` brach 2. Deleting the `staging branch` 3. Deleting the README from the production branch. The main thing I'm curious about is whether the PMC can delete branches easily from github. If so, it maybe much more straightforward, otherwise will have to file INFRA tickets -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on PR #53: URL: https://github.com/apache/parquet-site/pull/53#issuecomment-2106376534 > Thanks - and just to make sure it's clear, my main goal was to start the process of actually documenting the institutional knowledge in the community and this PR is mostly intended as a starting point. There are some other much meatier topics (parquet v2's definition for example) that will need to be discussed in follow up PRs. I think documenting the current / institutional knowledge is superful helpful. Thank you for pushing this forward -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
alamb commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2106376324 We may also want to update the readme too: https://github.com/apache/parquet-site/blob/production/README.md -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on PR #53: URL: https://github.com/apache/parquet-site/pull/53#issuecomment-2106374201 Thanks - and just to make sure it's clear, my main goal was to start the process of actually documenting the institutional knowledge in the community and this PR is mostly intended as a starting point. There are some other much meatier topics (parquet v2's definition for example) that will need to be discussed in follow up PRs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1597713221 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,40 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: Will update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Remove staging [parquet-site]
vinooganesh commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2106372596 cc @wgtmac @gszadovszky @alamb after conversation on https://github.com/apache/parquet-site/pull/56 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Remove staging [parquet-site]
vinooganesh opened a new pull request, #58: URL: https://github.com/apache/parquet-site/pull/58 There still needs to be an infra ticket filed to actually delete the `staging` branch (unless a PMC member can delete the branch) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Dockerfile + instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
alamb commented on PR #56: URL: https://github.com/apache/parquet-site/pull/56#issuecomment-2106180380 Thanks @wgtmac and @vinooganesh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
wgtmac commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1597576971 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,40 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### Parquet Format + +The "Parquet Format" project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### Parquet-MR + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +Parquet-MR can be seen as a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, which are listed below. + +Included in parquet-mr: +* Java/Scala Implementation: It contains the core Java/Scala implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. Review Comment: Perhaps we should just say Java implementation here. The scala code is just for filters and we don't have a full scala implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
xhochy commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1597576454 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,40 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: I would second the removal of the quotes here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
wgtmac commented on PR #53: URL: https://github.com/apache/parquet-site/pull/53#issuecomment-2106156421 @xhochy @pitrou @tustvold Would you like to take a final pass? Will merge it if there is no further comment next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Dockerfile + instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
wgtmac merged PR #56: URL: https://github.com/apache/parquet-site/pull/56 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Dockerfile + instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
wgtmac commented on PR #56: URL: https://github.com/apache/parquet-site/pull/56#issuecomment-2106152954 For the staging site, I had a discussion with @gszadovszky here: https://github.com/apache/parquet-site/pull/31#issuecomment-1474023977. I think we can remove the staging site now and use the docker file for debug purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update README.md on asf-site branch with pointer to real readme [parquet-site]
wgtmac merged PR #57: URL: https://github.com/apache/parquet-site/pull/57 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update README.md on asf-site branch with pointer to real readme [parquet-site]
wgtmac commented on PR #57: URL: https://github.com/apache/parquet-site/pull/57#issuecomment-2106151363 Thanks Andrew for doing this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Dockerfile + instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
vinooganesh commented on PR #56: URL: https://github.com/apache/parquet-site/pull/56#issuecomment-2105999510 Good question @alamb. Technically the "best practice" from the docsy instructions were to create a staging website so I mostly just followed them when I remade the parquet one. Back then, there was a lot of stuff to work through with hugo builds and migrating from the old jenkins site, so having a place to test was definitely helpful. At this point though, I don't think it's necessary to have the staging site anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update README.md on asf-site branch with pointer to real readme [parquet-site]
alamb commented on PR #57: URL: https://github.com/apache/parquet-site/pull/57#issuecomment-2105997880 Thanks @vinooganesh -- I started a thread to discuss this on the mailing list https://lists.apache.org/thread/97g4zqlvobr9knntvsbghjs6v3gr63x2 (which is typically what INFRA likes to see when making changes such as the default branch) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Dockerfile + instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
alamb commented on PR #56: URL: https://github.com/apache/parquet-site/pull/56#issuecomment-2105997211 > The other thing that we haven't been doing a good job of is maintaining the staging website. I made a bunch of changes to get the staging and production branch in sync, but staging still isn't heavily used. I wonder what the usecase for the staging website is? (maybe we should just not use it?) FWIW for https://arrow.apache.org/ and https://datafusion.apache.org/ we simply publish to the production version of the site. Sometimes the staging site might be helpful to host pre-release api docs or something, but I didn't see any on this site 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update README.md on asf-site branch with pointer to real readme [parquet-site]
vinooganesh commented on PR #57: URL: https://github.com/apache/parquet-site/pull/57#issuecomment-2105996926 +1, I think this will require an INFRA ticket. @shangxinli couldn't change it in the Github UI the last time we attempted to update it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Dockerfile + instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
vinooganesh commented on PR #56: URL: https://github.com/apache/parquet-site/pull/56#issuecomment-2105996596 This is a great suggestion and the timing is right. I spend some time a few weeks ago moving the parquet site's docsy dependency to a hugo module, so now they can be managed separately. The other thing that we haven't been doing a good job of is maintaining the staging website. I made a bunch of changes to get the `staging` and `production` branch in sync, but staging still isn't heavily used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
vinooganesh commented on PR #53: URL: https://github.com/apache/parquet-site/pull/53#issuecomment-2105995733 @wgtmac - given consensus here, would you be able to merge? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Build failed [parquet-site]
vinooganesh closed issue #54: Build failed URL: https://github.com/apache/parquet-site/issues/54 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Update README.md on asf-site branch with pointer to real readme [parquet-site]
alamb opened a new pull request, #57: URL: https://github.com/apache/parquet-site/pull/57 Note that this PR purposely targets the `asf-site` branch rather than the `development` or `production` branches ## Rationale The landing page of https://github.com/apache/parquet-site is confusing (it seems to have an outdated readme): ![Screenshot 2024-05-11 at 2 40 24 PM](https://github.com/apache/parquet-site/assets/490673/bf9c3187-e69a-4423-aaca-84a78b393e61) It was not immediately clear to me that `asf-site` branch is hosts the output of statically building the website and that the updated README / etc are on the `production` branch ## Changes Update the README to direct people to the `production` branch which has an updated readme -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Dockerfile + instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
alamb commented on PR #56: URL: https://github.com/apache/parquet-site/pull/56#issuecomment-2105983298 Thanks for the review @wgtmac -- I have implemented your suggestion and created a Dockerfile and updated the instructions to use them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Dockerfile + instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
alamb commented on PR #56: URL: https://github.com/apache/parquet-site/pull/56#issuecomment-2105983147 > Thanks for the improvement! My only concern is that these steps may be out of sync easily (e.g. when the provided URLs are broken). Perhaps we can update the instructions over time if/when they become broken? I am sure there are better ways to make such scripts, but in my opinion this is a step in the right direction -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
alamb commented on code in PR #56: URL: https://github.com/apache/parquet-site/pull/56#discussion_r1597483388 ## README.md: ## @@ -14,21 +15,61 @@ cd parquet-site git submodule update --init --recursive ``` -To build or update your site’s CSS resources, you also need PostCSS to create the final assets. By default npm installs tools under the directory where you run npm install. +To build or update CSS resources, you also need PostCSS to create the final assets. By default npm installs tools under the directory where you run npm install. ``` npm install -D autoprefixer npm install -D postcss-cli npm install -D postcss ``` -To run this website site locally, run the following in the root of the directory: +To preview this website site locally, run the following in the root of the directory: ```shell hugo server ``` -# Release Documentation +## Building and Running in Docker + +If you don't want to install `hugo` and its dependencies local machine, you can use +docker to preview locally. First checkout the `parquet-site` explained above +and then run: + +```shell +# run docker container mounting the current directory to /parquet-site and exposing port 1313 +docker run -it -v `pwd`:/parquet-site -p 1313:1313 debian:bullseye-slim bash Review Comment: A docker file is a good idea. I will make one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
wgtmac commented on code in PR #56: URL: https://github.com/apache/parquet-site/pull/56#discussion_r1597467736 ## README.md: ## @@ -14,21 +15,61 @@ cd parquet-site git submodule update --init --recursive ``` -To build or update your site’s CSS resources, you also need PostCSS to create the final assets. By default npm installs tools under the directory where you run npm install. +To build or update CSS resources, you also need PostCSS to create the final assets. By default npm installs tools under the directory where you run npm install. ``` npm install -D autoprefixer npm install -D postcss-cli npm install -D postcss ``` -To run this website site locally, run the following in the root of the directory: +To preview this website site locally, run the following in the root of the directory: ```shell hugo server ``` -# Release Documentation +## Building and Running in Docker + +If you don't want to install `hugo` and its dependencies local machine, you can use Review Comment: ```suggestion If you don't want to install `hugo` and its dependencies on local machine, you can use ``` ## README.md: ## @@ -14,21 +15,61 @@ cd parquet-site git submodule update --init --recursive ``` -To build or update your site’s CSS resources, you also need PostCSS to create the final assets. By default npm installs tools under the directory where you run npm install. +To build or update CSS resources, you also need PostCSS to create the final assets. By default npm installs tools under the directory where you run npm install. ``` npm install -D autoprefixer npm install -D postcss-cli npm install -D postcss ``` -To run this website site locally, run the following in the root of the directory: +To preview this website site locally, run the following in the root of the directory: ```shell hugo server ``` -# Release Documentation +## Building and Running in Docker + +If you don't want to install `hugo` and its dependencies local machine, you can use +docker to preview locally. First checkout the `parquet-site` explained above +and then run: + +```shell +# run docker container mounting the current directory to /parquet-site and exposing port 1313 +docker run -it -v `pwd`:/parquet-site -p 1313:1313 debian:bullseye-slim bash Review Comment: Is it better to use a dockerfile which is much easier to use? I'm just asking but not required to change. These steps are helpful enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Add instructions on how to preview site using docker rather than installing `hugo` locally [parquet-site]
alamb opened a new pull request, #56: URL: https://github.com/apache/parquet-site/pull/56 In order to make changes to the website and have confidence that we won't break things we should make sure we can see the results of the work locally. I can't / don't want to try and figure out how to get a local `hugo` install running locally, and prefer to use docker. I figured these instructions might help others BTW I am happy to make a JIRA for this PR, but it isn't clear to me if that is desired or NOT in Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1597433857 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,40 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. + +Included in parquet-mr: +* Java/Scala Implementation: It contains the core Java/Scala implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. + +* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. + + +### Other Clients / Libraries / Tools + +The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It's important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools. + +Here is a non-exhaustive list of Parquet implementations: + +* [parquet-mr](https://github.com/apache/parquet-mr) +* [Parquet C++, a subproject of Arrow C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) ([documentation](https://arrow.apache.org/docs/cpp/parquet.html)) +* [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) Review Comment: ```suggestion * [parquet go](https://github.com/apache/arrow/tree/main/go/parquet) * [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] PARQUET-2310: implementation status [parquet-site]
alamb commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2105688283 FYI https://github.com/apache/parquet-site/pull/53 is a related conversation. Once that PR merges perhaps there will be a more natural location for this chart / location -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] First draft of docs about parquet format vs mr [parquet-site]
alamb commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1597433429 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,40 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### parquet-format + +The parquet-format project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### parquet-mr + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + +The parquet-mr repo contains a "reference" implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. Review Comment: I don't think there is any reason to add quotes ```suggestion The parquet-mr repo contains a reference implementation of the Parquet format. There are a number of other Parquet format implementations, which are listed below. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org