User metadata can be specified in parquet via key-value metadata. But, once
a parquet file has been written, any modification will require a re-write.
Basically, de-serialize, modify and serialize. Bytes that are part of the
parquet-format (spec) will require the above process.
If your proposal is to keep a scratchpad buffer say (64KB) at the end of a
file that is not part of the parquet-format, I don't see a lot of benefit
to it. Why not store the custom extensions as a separate file?

And, this is indeed the right place to bring up more ideas.


On Thu, Jun 8, 2017 at 1:47 PM, Felipe Aramburu <[email protected]>
wrote:

> It might be interesting at some point to consider specifying some extra
> bytes available in the metadata that can be used to read  potential
> extensions.
>
> Does that sound silly? Is this the right place to bring up ideas like this?
>
> On Thu, Jun 8, 2017 at 11:34 AM, Lars Volker <[email protected]> wrote:
>
> > Yes, I can't think of a way to add this information to the file without
> at
> > least partially rewriting it. I don't know of a tool to update file
> > metadata without doing a complete rewrite.
> >
> > On Thu, Jun 8, 2017 at 9:04 AM, Felipe Aramburu <[email protected]>
> > wrote:
> >
> > > The answer to this is probably no. But I  imagine that it is not
> > considered
> > > acceptable to try and modify this statistics information AFTER the
> > parquet
> > > file has been generated correct?
> > > ᐧ
> > >
> > > On Thu, Jun 8, 2017 at 9:59 AM, Lars Volker <[email protected]> wrote:
> > >
> > > > I suppose you would look at the Statistics struct in the
> parquet.thrift
> > > > <https://github.com/apache/parquet-format/blob/master/
> > > > src/main/thrift/parquet.thrift>
> > > > file
> > > > in the parquet-format project. Before spending much time on this, you
> > may
> > > > want to seek more feedback, possibly on this list, and by opening a
> > JIRA.
> > > > Since it likely is a rather small change, you might also go ahead and
> > > > create a pull request and ask for feedback there. Please note, that
> the
> > > PR
> > > > will need a corresponding JIRA in its title.
> > > >
> > > > You can find more detailed information on the individual steps here:
> > > > https://parquet.apache.org/contribute/
> > > >
> > > > Cheers, Lars
> > > >
> > > > On Wed, Jun 7, 2017 at 11:40 AM, Felipe Aramburu <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > So I guess its just calculating the distance between the offsets.
> For
> > > now
> > > > > we might just make that part of our "catalogue" step. If I wanted
> to
> > > add
> > > > it
> > > > > to statistics is there somewhere you can point me to where that
> would
> > > be
> > > > > added?
> > > > >
> > > > > Felipe
> > > > > ᐧ
> > > > >
> > > > > On Wed, Jun 7, 2017 at 1:33 PM, Michael Howard <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > > Could this be a candidate to add to the Statistics?
> > > > > >
> > > > > > Agreed ... this would be good info to have.
> > > > > >
> > > > > > On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Could this be a candidate to add to the Statistics?
> > > > > > >
> > > > > > > On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti <
> > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > The parquet metadata does not have such information.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Is there any metadata available on the maximum length of an
> > > > element
> > > > > > > > > BYTE_ARRAY in a row group.
> > > > > > > > >
> > > > > > > > > So for example if I have a column which is of type
> BYTE_ARRAY
> > > > > Logical
> > > > > > > > type
> > > > > > > > > UTF8 and I want to know what the longest possible element
> in
> > > the
> > > > > row
> > > > > > > > group
> > > > > > > > > is.
> > > > > > > > >
> > > > > > > > > I am looking for a method to do this which does NOT require
> > > > having
> > > > > to
> > > > > > > go
> > > > > > > > > through the data itself. So I am asking if this metadata is
> > > > stored
> > > > > > > > > anywhere.
> > > > > > > > >
> > > > > > > > > Felipe
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > regards,
> > > > > > > > Deepak Majeti
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
regards,
Deepak Majeti

Reply via email to