I am not sure if there is API to directly modify serialized thrift data. On Fri, Jun 9, 2017 at 5:45 AM, Felipe Aramburu <[email protected]> wrote:
> If you were to store it as a seperate file, as opposed to making it inside > the file, then all of the sudden you are having to manage keeping those two > files available. Your individual parquet files no longer contain all the > information you may want or need. Is it not possible to modify the > contents of a parquet file so long as you are not changing the size of the > data that was written? So for example if I have a parquet file and I want > to modify the bytes in that file without changing the size of the file. I > am pretty sure this is possible right? > > On Thu, Jun 8, 2017 at 11:09 PM, Deepak Majeti <[email protected]> > wrote: > > > User metadata can be specified in parquet via key-value metadata. But, > once > > a parquet file has been written, any modification will require a > re-write. > > Basically, de-serialize, modify and serialize. Bytes that are part of the > > parquet-format (spec) will require the above process. > > If your proposal is to keep a scratchpad buffer say (64KB) at the end of > a > > file that is not part of the parquet-format, I don't see a lot of benefit > > to it. Why not store the custom extensions as a separate file? > > > > And, this is indeed the right place to bring up more ideas. > > > > > > On Thu, Jun 8, 2017 at 1:47 PM, Felipe Aramburu <[email protected]> > > wrote: > > > > > It might be interesting at some point to consider specifying some extra > > > bytes available in the metadata that can be used to read potential > > > extensions. > > > > > > Does that sound silly? Is this the right place to bring up ideas like > > this? > > > > > > On Thu, Jun 8, 2017 at 11:34 AM, Lars Volker <[email protected]> wrote: > > > > > > > Yes, I can't think of a way to add this information to the file > without > > > at > > > > least partially rewriting it. I don't know of a tool to update file > > > > metadata without doing a complete rewrite. > > > > > > > > On Thu, Jun 8, 2017 at 9:04 AM, Felipe Aramburu < > [email protected]> > > > > wrote: > > > > > > > > > The answer to this is probably no. But I imagine that it is not > > > > considered > > > > > acceptable to try and modify this statistics information AFTER the > > > > parquet > > > > > file has been generated correct? > > > > > ᐧ > > > > > > > > > > On Thu, Jun 8, 2017 at 9:59 AM, Lars Volker <[email protected]> > wrote: > > > > > > > > > > > I suppose you would look at the Statistics struct in the > > > parquet.thrift > > > > > > <https://github.com/apache/parquet-format/blob/master/ > > > > > > src/main/thrift/parquet.thrift> > > > > > > file > > > > > > in the parquet-format project. Before spending much time on this, > > you > > > > may > > > > > > want to seek more feedback, possibly on this list, and by > opening a > > > > JIRA. > > > > > > Since it likely is a rather small change, you might also go ahead > > and > > > > > > create a pull request and ask for feedback there. Please note, > that > > > the > > > > > PR > > > > > > will need a corresponding JIRA in its title. > > > > > > > > > > > > You can find more detailed information on the individual steps > > here: > > > > > > https://parquet.apache.org/contribute/ > > > > > > > > > > > > Cheers, Lars > > > > > > > > > > > > On Wed, Jun 7, 2017 at 11:40 AM, Felipe Aramburu < > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > > > So I guess its just calculating the distance between the > offsets. > > > For > > > > > now > > > > > > > we might just make that part of our "catalogue" step. If I > wanted > > > to > > > > > add > > > > > > it > > > > > > > to statistics is there somewhere you can point me to where that > > > would > > > > > be > > > > > > > added? > > > > > > > > > > > > > > Felipe > > > > > > > ᐧ > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 1:33 PM, Michael Howard < > > > > [email protected] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > Could this be a candidate to add to the Statistics? > > > > > > > > > > > > > > > > Agreed ... this would be good info to have. > > > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 2:25 PM, Lars Volker <[email protected] > > > > > > wrote: > > > > > > > > > > > > > > > > > Could this be a candidate to add to the Statistics? > > > > > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 11:18 AM, Deepak Majeti < > > > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > The parquet metadata does not have such information. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 1:08 PM, Felipe Aramburu < > > > > > > > [email protected]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Is there any metadata available on the maximum length > of > > an > > > > > > element > > > > > > > > > > > BYTE_ARRAY in a row group. > > > > > > > > > > > > > > > > > > > > > > So for example if I have a column which is of type > > > BYTE_ARRAY > > > > > > > Logical > > > > > > > > > > type > > > > > > > > > > > UTF8 and I want to know what the longest possible > element > > > in > > > > > the > > > > > > > row > > > > > > > > > > group > > > > > > > > > > > is. > > > > > > > > > > > > > > > > > > > > > > I am looking for a method to do this which does NOT > > require > > > > > > having > > > > > > > to > > > > > > > > > go > > > > > > > > > > > through the data itself. So I am asking if this > metadata > > is > > > > > > stored > > > > > > > > > > > anywhere. > > > > > > > > > > > > > > > > > > > > > > Felipe > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > regards, > > > > > > > > > > Deepak Majeti > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > regards, > > Deepak Majeti > > > -- regards, Deepak Majeti
