[
https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276735#comment-17276735
]
ASF GitHub Bot commented on PARQUET-1950:
-----------------------------------------
raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827
@gszadovszky and @emkornfield it's highly coincidental that I was just
looking into cleaning up apache/arrow#8130 when I noticed this thread.
External column chunks support is one of the key features that attracted me
to parquet in the first place and I would like the chance to lobby for keeping
it and actually expanding its adoption - I already have the complete PR
mentioned above and I can help with supporting it across other implementations.
There are a few major domains where I see this as valuable component:
1. Allowing concurrent read to fully flushed row groups while parquet file
is still being appended to. A slight variant of this is allowing subsequent row
group appends to a parquet file without impacting potential readers.
2. Being able to aggregate multiple data sets in a master parquet file: One
scenario if cumulative recordings like stock prices that get collected daily
and need to be presented as one unified historical file, another the case of
enrichment where we want to add new columns to an existing data set.
3. Allowing for bi-temporal changes to parquet file: External columns chunks
allows one to apply small corrections by simply creating delta files and new
footers that simply swap out the chunks that require changes and point to the
new ones.
If the above use cases are addressed by other parquet overlays or they don't
line up with the intended usage of parquet I can look elsewhere but it seems
like huge opportunity and the development cost for supporting it are quite
minor by comparison
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Define core features / compliance level
> ---------------------------------------
>
> Key: PARQUET-1950
> URL: https://issues.apache.org/jira/browse/PARQUET-1950
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format
> Reporter: Gabor Szadovszky
> Assignee: Gabor Szadovszky
> Priority: Major
>
> Parquet format is getting more and more features while the different
> implementations cannot keep the pace and left behind with some features
> implemented and some are not. In many cases it is also not clear if the
> related feature is mature enough to be used widely or more an experimental
> one.
> These are huge issues that makes hard ensure interoperability between the
> different implementations.
> The following idea came up in a
> [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E].
> Create a now document in the parquet-format repository that lists the "core
> features". This document is versioned by the parquet-format releases. This
> way a certain version of "core features" defines a level of compatibility
> between the different implementations. This version number can be written to
> a new field (e.g. complianceLevel) in the footer. If an implementation writes
> a file with a version in the field it must implement all the related "core
> features" (read and write) and must not use any other features at write
> because it makes the data unreadable by another implementation if only the
> same level of "core features" are implemented.
> For example if we have encoding A listed in the version 1 "core features" but
> encoding B is not then at "complianceLevel = 1" we can use encoding A but we
> cannot use encoding B because it would make the related data unreadable.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)