petenewcomb opened a new issue, #40630:
URL: https://github.com/apache/arrow/issues/40630

   ### Describe the enhancement requested
   
   The Parquet file format allows a file to continue to accumulate row groups 
after a footer has been written, as long as a new and cumulative footer is 
written afterward.  This is useful if one is writing a stream of data directly 
to Parquet and need to make sure that that data is fully durable and readable 
within some time bound.  For this purpose I propose a new method 
`FlushWithFooter`  on `file.Writer` that like its sibling `Close` would close 
any open row group and prepare and write out the file footer.  Unlike `Close` 
it would leave the writer's metadata structures intact, allowing subsequent row 
groups to be written without starting over, thus ensuring that the metadata 
written into subsequent footers via `FlushWithFooter` or `Close` is inclusive 
of all row groups written since the beginning of the file.
   
   The alternative, and what is supported today, is to close the open file once 
the time bound has been reached and start a new one.  This works for 
durability, but is inefficient for readers since they must now open and process 
the footers of a potentially much larger number of files.  The typical workflow 
is to have a second process "compact" these smaller files to produce larger 
files that not only consolidate footers but apply other optimizations (such as 
z-ordering) that holistically reorganize the consolidate data to match observed 
or expected query patterns.  While effective for readers of older data, such 
compactions take time and significant resources to execute, putting a practical 
lower bound on the freshness of their outputs.
   
   This feature, if adopted, would allow writers to produce data into a modest 
and predictable number of files within a strict time bound for durability such 
that readers enjoy that same time bound and modest number of files to 
efficiently query fresh data without intervening compaction.  Compaction would 
still be recommended, both to apply holistic optimizations and to collapse the 
extra footers inserted into the original files, but it would be less urgent 
since compaction would no longer be a constraint on freshness or the 
manageability of file cardinality.
   
   ### Component(s)
   
   Go, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to