>
> But I'm also attaching table
> metadata to each feather, which I'd hate to lose.
>

Note the arrow format allows attaching custom metadata at the column
(field), schema, and message level, so it should be possible to retain any
metadata this way.

-Jacob

On Wed, Oct 7, 2020 at 11:38 AM Benjamin MacDonald Schmidt <
[email protected]> wrote:

> Hello,
>
> Exciting project, thanks for all your work. I gather it's appropriate to
> ask a use question here? Assuming so:
>
> I have a web application that serves portions of a dataset I've broken into
> a few thousand featherV2 files structured as a quadtree. The structure
> makes heavy use of text dictionary types; I'd like to have each dictionary
> integer map to the same string across all files so that I can ship the data
> for each tile straight to GPU without decoding the text.
>
> If you slice a portion of a pandas categorical array and coerce to an arrow
> dictionary, you keep the underlying pandas integer encoding; for example,
> the last line here shows a dictionary with four keys even though the table
> has just one row.
>
> ```
> import pandas as pd
> import pyarrow as pa
> pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category")
> pa.Array.from_pandas(pandas_cat[2:3])
> ```
>
> For my purposes, this is good! But of course it's wasteful, too. So I'm
> wondering:
>
> 1. Whether it's safe to count on the above code continuing to use the
> internal pandas keys in the arrow output, or whether at some point it might
> redo the pandas encoding in a more efficient way;
> 2. Whether there's a native pyarrow way to ensure that multiple feather
> dictionaries across files use the same integer identifiers for all the keys
> that they share.
>
> I can see that the right way here might be to use the IPC streaming format
> rather than feather, and send out a single schema for the dataset, with
> dictionary batches identifying the keys. But I'm also attaching table
> metadata to each feather, which I'd hate to lose.
>
> --
> Benjamin Schmidt
> Director of Digital Humanities and Clinical Associate Professor of History
> 20 Cooper Square, Room 538
> New York University
>
> <http://goog_1230609213>
> benschmidt.org
>

Reply via email to