Yes, I can't think of a way to add this information to the file without at least partially rewriting it. I don't know of a tool to update file metadata without doing a complete rewrite.
On Thu, Jun 8, 2017 at 9:04 AM, Felipe Aramburu <[email protected]> wrote: > The answer to this is probably no. But I imagine that it is not considered > acceptable to try and modify this statistics information AFTER the parquet > file has been generated correct? > ᐧ > > On Thu, Jun 8, 2017 at 9:59 AM, Lars Volker <[email protected]> wrote: > > > I suppose you would look at the Statistics struct in the parquet.thrift > > <https://github.com/apache/parquet-format/blob/master/ > > src/main/thrift/parquet.thrift> > > file > > in the parquet-format project. Before spending much time on this, you may > > want to seek more feedback, possibly on this list, and by opening a JIRA. > > Since it likely is a rather small change, you might also go ahead and > > create a pull request and ask for feedback there. Please note, that the > PR > > will need a corresponding JIRA in its title. > > > > You can find more detailed information on the individual steps here: > > https://parquet.apache.org/contribute/ > > > > Cheers, Lars > > > > On Wed, Jun 7, 2017 at 11:40 AM, Felipe Aramburu <[email protected]> > > wrote: > > > > > So I guess its just calculating the distance between the offsets. For > now > > > we might just make that part of our "catalogue" step. If I wanted to > add > > it > > > to statistics is there somewhere you can point me to where that would > be > > > added? > > > > > > Felipe > > > ᐧ > > > > > > On Wed, Jun 7, 2017 at 1:33 PM, Michael Howard <[email protected] > > > > > wrote: > > > > > > > > Could this be a candidate to add to the Statistics? > > > > > > > > Agreed ... this would be good info to have. > > > > > > > > On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker <[email protected]> wrote: > > > > > > > > > Could this be a candidate to add to the Statistics? > > > > > > > > > > On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > The parquet metadata does not have such information. > > > > > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu < > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Is there any metadata available on the maximum length of an > > element > > > > > > > BYTE_ARRAY in a row group. > > > > > > > > > > > > > > So for example if I have a column which is of type BYTE_ARRAY > > > Logical > > > > > > type > > > > > > > UTF8 and I want to know what the longest possible element in > the > > > row > > > > > > group > > > > > > > is. > > > > > > > > > > > > > > I am looking for a method to do this which does NOT require > > having > > > to > > > > > go > > > > > > > through the data itself. So I am asking if this metadata is > > stored > > > > > > > anywhere. > > > > > > > > > > > > > > Felipe > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > regards, > > > > > > Deepak Majeti > > > > > > > > > > > > > > > > > > > > >
