One more general question is whether the file format is really
beneficial over the stream format in practice. I understand the
theoretical argument for direct access to specific batches, but are
there situations where it really matters? Intuitively, it seems to me
that if your data is really large, you may be better off with a more
space-optimized format such as Parquet.
Le 19/03/2021 à 19:49, Wes McKinney a écrit :
Okay, let’s open an issue then to address that at some point. What I recall
from our last discussion was that the dictionaries would be “processed”
when beginning to read the file, appending all the deltas to yield one set
of dictionaries for reassembly. The downside is that the “partial
dictionaries” that existed at the time that the file was written are not
recoverable, but that seems like an acceptable compromise.
On Fri, Mar 19, 2021 at 10:34 AM Antoine Pitrou <anto...@python.org> wrote:
Le 19/03/2021 à 13:37, Wes McKinney a écrit :
I am also under the impression that the file format is supposed to
support
deltas, but not replacements. Is this not implemented in C++?
Definitely not. Also I was not aware that the file format was supposed
to support deltas.
Regards
Antoine.