Hello, Exciting project, thanks for all your work. I gather it's appropriate to ask a use question here? Assuming so:
I have a web application that serves portions of a dataset I've broken into a few thousand featherV2 files structured as a quadtree. The structure makes heavy use of text dictionary types; I'd like to have each dictionary integer map to the same string across all files so that I can ship the data for each tile straight to GPU without decoding the text. If you slice a portion of a pandas categorical array and coerce to an arrow dictionary, you keep the underlying pandas integer encoding; for example, the last line here shows a dictionary with four keys even though the table has just one row. ``` import pandas as pd import pyarrow as pa pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category") pa.Array.from_pandas(pandas_cat[2:3]) ``` For my purposes, this is good! But of course it's wasteful, too. So I'm wondering: 1. Whether it's safe to count on the above code continuing to use the internal pandas keys in the arrow output, or whether at some point it might redo the pandas encoding in a more efficient way; 2. Whether there's a native pyarrow way to ensure that multiple feather dictionaries across files use the same integer identifiers for all the keys that they share. I can see that the right way here might be to use the IPC streaming format rather than feather, and send out a single schema for the dataset, with dictionary batches identifying the keys. But I'm also attaching table metadata to each feather, which I'd hate to lose. -- Benjamin Schmidt Director of Digital Humanities and Clinical Associate Professor of History 20 Cooper Square, Room 538 New York University <http://goog_1230609213> benschmidt.org