> > But I'm also attaching table > metadata to each feather, which I'd hate to lose. >
Note the arrow format allows attaching custom metadata at the column (field), schema, and message level, so it should be possible to retain any metadata this way. -Jacob On Wed, Oct 7, 2020 at 11:38 AM Benjamin MacDonald Schmidt < [email protected]> wrote: > Hello, > > Exciting project, thanks for all your work. I gather it's appropriate to > ask a use question here? Assuming so: > > I have a web application that serves portions of a dataset I've broken into > a few thousand featherV2 files structured as a quadtree. The structure > makes heavy use of text dictionary types; I'd like to have each dictionary > integer map to the same string across all files so that I can ship the data > for each tile straight to GPU without decoding the text. > > If you slice a portion of a pandas categorical array and coerce to an arrow > dictionary, you keep the underlying pandas integer encoding; for example, > the last line here shows a dictionary with four keys even though the table > has just one row. > > ``` > import pandas as pd > import pyarrow as pa > pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category") > pa.Array.from_pandas(pandas_cat[2:3]) > ``` > > For my purposes, this is good! But of course it's wasteful, too. So I'm > wondering: > > 1. Whether it's safe to count on the above code continuing to use the > internal pandas keys in the arrow output, or whether at some point it might > redo the pandas encoding in a more efficient way; > 2. Whether there's a native pyarrow way to ensure that multiple feather > dictionaries across files use the same integer identifiers for all the keys > that they share. > > I can see that the right way here might be to use the IPC streaming format > rather than feather, and send out a single schema for the dataset, with > dictionary batches identifying the keys. But I'm also attaching table > metadata to each feather, which I'd hate to lose. > > -- > Benjamin Schmidt > Director of Digital Humanities and Clinical Associate Professor of History > 20 Cooper Square, Room 538 > New York University > > <http://goog_1230609213> > benschmidt.org >
