Re: Dictionary key access in python/generally

Micah Kornfield Wed, 07 Oct 2020 21:29:15 -0700

I can't speak to whether Pandas conversion will ever change.  Some one else
can potentially chime in I don't recall any JIRAs recently changing this
type of conversion, however currently for library functionality there
aren't any hard guarantees for backwards compatibility (generally we try to
do our best to not break things).


I can see that the right way here might be to use the IPC streaming format
> rather than feather, and send out a single schema for the dataset, with
> dictionary batches identifying the keys.


Feather V2 should be the same as the Arrow file format which is different
then the stream format.  There is a direct writer [1] for this as well, so
if you have the ability to construct your arrow tables directly from the
same dictionary, this would be the best way of ensuring any changes to the
Pandas conversion would not impact you.

[1]
https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files

On Wed, Oct 7, 2020 at 10:44 AM Jacob Quinn <[email protected]> wrote:

> >
> > But I'm also attaching table
> > metadata to each feather, which I'd hate to lose.
> >
>
> Note the arrow format allows attaching custom metadata at the column
> (field), schema, and message level, so it should be possible to retain any
> metadata this way.
>
> -Jacob
>
> On Wed, Oct 7, 2020 at 11:38 AM Benjamin MacDonald Schmidt <
> [email protected]> wrote:
>
> > Hello,
> >
> > Exciting project, thanks for all your work. I gather it's appropriate to
> > ask a use question here? Assuming so:
> >
> > I have a web application that serves portions of a dataset I've broken
> into
> > a few thousand featherV2 files structured as a quadtree. The structure
> > makes heavy use of text dictionary types; I'd like to have each
> dictionary
> > integer map to the same string across all files so that I can ship the
> data
> > for each tile straight to GPU without decoding the text.
> >
> > If you slice a portion of a pandas categorical array and coerce to an
> arrow
> > dictionary, you keep the underlying pandas integer encoding; for example,
> > the last line here shows a dictionary with four keys even though the
> table
> > has just one row.
> >
> > ```
> > import pandas as pd
> > import pyarrow as pa
> > pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category")
> > pa.Array.from_pandas(pandas_cat[2:3])
> > ```
> >
> > For my purposes, this is good! But of course it's wasteful, too. So I'm
> > wondering:
> >
> > 1. Whether it's safe to count on the above code continuing to use the
> > internal pandas keys in the arrow output, or whether at some point it
> might
> > redo the pandas encoding in a more efficient way;
> > 2. Whether there's a native pyarrow way to ensure that multiple feather
> > dictionaries across files use the same integer identifiers for all the
> keys
> > that they share.
> >
> > I can see that the right way here might be to use the IPC streaming
> format
> > rather than feather, and send out a single schema for the dataset, with
> > dictionary batches identifying the keys. But I'm also attaching table
> > metadata to each feather, which I'd hate to lose.
> >
> > --
> > Benjamin Schmidt
> > Director of Digital Humanities and Clinical Associate Professor of
> History
> > 20 Cooper Square, Room 538
> > New York University
> >
> > <http://goog_1230609213>
> > benschmidt.org
> >
>

Re: Dictionary key access in python/generally

Reply via email to