It might be interesting at some point to consider specifying some extra bytes available in the metadata that can be used to read potential extensions.
Does that sound silly? Is this the right place to bring up ideas like this? On Thu, Jun 8, 2017 at 11:34 AM, Lars Volker <[email protected]> wrote: > Yes, I can't think of a way to add this information to the file without at > least partially rewriting it. I don't know of a tool to update file > metadata without doing a complete rewrite. > > On Thu, Jun 8, 2017 at 9:04 AM, Felipe Aramburu <[email protected]> > wrote: > > > The answer to this is probably no. But I imagine that it is not > considered > > acceptable to try and modify this statistics information AFTER the > parquet > > file has been generated correct? > > ᐧ > > > > On Thu, Jun 8, 2017 at 9:59 AM, Lars Volker <[email protected]> wrote: > > > > > I suppose you would look at the Statistics struct in the parquet.thrift > > > <https://github.com/apache/parquet-format/blob/master/ > > > src/main/thrift/parquet.thrift> > > > file > > > in the parquet-format project. Before spending much time on this, you > may > > > want to seek more feedback, possibly on this list, and by opening a > JIRA. > > > Since it likely is a rather small change, you might also go ahead and > > > create a pull request and ask for feedback there. Please note, that the > > PR > > > will need a corresponding JIRA in its title. > > > > > > You can find more detailed information on the individual steps here: > > > https://parquet.apache.org/contribute/ > > > > > > Cheers, Lars > > > > > > On Wed, Jun 7, 2017 at 11:40 AM, Felipe Aramburu <[email protected] > > > > > wrote: > > > > > > > So I guess its just calculating the distance between the offsets. For > > now > > > > we might just make that part of our "catalogue" step. If I wanted to > > add > > > it > > > > to statistics is there somewhere you can point me to where that would > > be > > > > added? > > > > > > > > Felipe > > > > ᐧ > > > > > > > > On Wed, Jun 7, 2017 at 1:33 PM, Michael Howard < > [email protected] > > > > > > > wrote: > > > > > > > > > > Could this be a candidate to add to the Statistics? > > > > > > > > > > Agreed ... this would be good info to have. > > > > > > > > > > On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker <[email protected]> > wrote: > > > > > > > > > > > Could this be a candidate to add to the Statistics? > > > > > > > > > > > > On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti < > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > The parquet metadata does not have such information. > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu < > > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > Is there any metadata available on the maximum length of an > > > element > > > > > > > > BYTE_ARRAY in a row group. > > > > > > > > > > > > > > > > So for example if I have a column which is of type BYTE_ARRAY > > > > Logical > > > > > > > type > > > > > > > > UTF8 and I want to know what the longest possible element in > > the > > > > row > > > > > > > group > > > > > > > > is. > > > > > > > > > > > > > > > > I am looking for a method to do this which does NOT require > > > having > > > > to > > > > > > go > > > > > > > > through the data itself. So I am asking if this metadata is > > > stored > > > > > > > > anywhere. > > > > > > > > > > > > > > > > Felipe > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > regards, > > > > > > > Deepak Majeti > > > > > > > > > > > > > > > > > > > > > > > > > > > >
