If you were to store it as a seperate file, as opposed to making it inside the file, then all of the sudden you are having to manage keeping those two files available. Your individual parquet files no longer contain all the information you may want or need. Is it not possible to modify the contents of a parquet file so long as you are not changing the size of the data that was written? So for example if I have a parquet file and I want to modify the bytes in that file without changing the size of the file. I am pretty sure this is possible right?
On Thu, Jun 8, 2017 at 11:09 PM, Deepak Majeti <[email protected]> wrote: > User metadata can be specified in parquet via key-value metadata. But, once > a parquet file has been written, any modification will require a re-write. > Basically, de-serialize, modify and serialize. Bytes that are part of the > parquet-format (spec) will require the above process. > If your proposal is to keep a scratchpad buffer say (64KB) at the end of a > file that is not part of the parquet-format, I don't see a lot of benefit > to it. Why not store the custom extensions as a separate file? > > And, this is indeed the right place to bring up more ideas. > > > On Thu, Jun 8, 2017 at 1:47 PM, Felipe Aramburu <[email protected]> > wrote: > > > It might be interesting at some point to consider specifying some extra > > bytes available in the metadata that can be used to read potential > > extensions. > > > > Does that sound silly? Is this the right place to bring up ideas like > this? > > > > On Thu, Jun 8, 2017 at 11:34 AM, Lars Volker <[email protected]> wrote: > > > > > Yes, I can't think of a way to add this information to the file without > > at > > > least partially rewriting it. I don't know of a tool to update file > > > metadata without doing a complete rewrite. > > > > > > On Thu, Jun 8, 2017 at 9:04 AM, Felipe Aramburu <[email protected]> > > > wrote: > > > > > > > The answer to this is probably no. But I imagine that it is not > > > considered > > > > acceptable to try and modify this statistics information AFTER the > > > parquet > > > > file has been generated correct? > > > > ᐧ > > > > > > > > On Thu, Jun 8, 2017 at 9:59 AM, Lars Volker <[email protected]> wrote: > > > > > > > > > I suppose you would look at the Statistics struct in the > > parquet.thrift > > > > > <https://github.com/apache/parquet-format/blob/master/ > > > > > src/main/thrift/parquet.thrift> > > > > > file > > > > > in the parquet-format project. Before spending much time on this, > you > > > may > > > > > want to seek more feedback, possibly on this list, and by opening a > > > JIRA. > > > > > Since it likely is a rather small change, you might also go ahead > and > > > > > create a pull request and ask for feedback there. Please note, that > > the > > > > PR > > > > > will need a corresponding JIRA in its title. > > > > > > > > > > You can find more detailed information on the individual steps > here: > > > > > https://parquet.apache.org/contribute/ > > > > > > > > > > Cheers, Lars > > > > > > > > > > On Wed, Jun 7, 2017 at 11:40 AM, Felipe Aramburu < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > So I guess its just calculating the distance between the offsets. > > For > > > > now > > > > > > we might just make that part of our "catalogue" step. If I wanted > > to > > > > add > > > > > it > > > > > > to statistics is there somewhere you can point me to where that > > would > > > > be > > > > > > added? > > > > > > > > > > > > Felipe > > > > > > ᐧ > > > > > > > > > > > > On Wed, Jun 7, 2017 at 1:33 PM, Michael Howard < > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > > > > Could this be a candidate to add to the Statistics? > > > > > > > > > > > > > > Agreed ... this would be good info to have. > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker <[email protected]> > > > wrote: > > > > > > > > > > > > > > > Could this be a candidate to add to the Statistics? > > > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti < > > > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > The parquet metadata does not have such information. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu < > > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Is there any metadata available on the maximum length of > an > > > > > element > > > > > > > > > > BYTE_ARRAY in a row group. > > > > > > > > > > > > > > > > > > > > So for example if I have a column which is of type > > BYTE_ARRAY > > > > > > Logical > > > > > > > > > type > > > > > > > > > > UTF8 and I want to know what the longest possible element > > in > > > > the > > > > > > row > > > > > > > > > group > > > > > > > > > > is. > > > > > > > > > > > > > > > > > > > > I am looking for a method to do this which does NOT > require > > > > > having > > > > > > to > > > > > > > > go > > > > > > > > > > through the data itself. So I am asking if this metadata > is > > > > > stored > > > > > > > > > > anywhere. > > > > > > > > > > > > > > > > > > > > Felipe > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > regards, > > > > > > > > > Deepak Majeti > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > regards, > Deepak Majeti >
