Hi, Background: I have a need to optimize read speed for few-column lookups in large datasets. Currently I have the data in Plasma to have fast reading of it, but Plasma is cumbersome to manage when the data frequently changes (and “locks” the ram). Instead I’m trying to figure out a fast-enough approach to read columns from an arrow file from disk (100-200ms range). Reading from an Arrow file appears to be fast enough (even though I unfortunately have string values so zero-copy is out of the question). However, to read an arrow file I first need to generate it.
Current solution: I’m re-writing a parquet file to an arrow file row group per row group (dataset is bigger than RAM). Initially I had a naive implementation that read a batch from the parquet (using pyarrow.parquet) and tried to write it to a RecordBatchFileWriter. However that quickly led to: pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.[1] Trying to fix this (depending on my solution) I had a question: 1. Is there a way when creating the pyarrow schema define what categorical values exist in the dictionaries? Or force a specific dictionary when using `pa.DictionaryArray.from_pandas` Right now I use `pa.DictionaryArray.from_arrays` with the same dictionary values as an array, but it’s pretty cumbersome since I basically - per column per row group - need to convert the column values into the indices. Naive implementation: >>> def create_dictionary_array_indices(column_name, arrow_array): >>> global categories_columns >>> values = arrow_array.to_pylist() >>> indices = [] >>> for i, value in enumerate(values): >>> if not value or value != value: >>> indices.append(None) >>> else: >>> indices.append( >>> dictionary_values[column_name].index(value) >>> ) >>> indices = pd.array(indices, dtype=pd.Int32Dtype()) >>> return pa.DictionaryArray.from_arrays(indices, dictionary_values[column_name]) I also tried using pa.DictionaryArray.from_pandas with Series, but even though I had the same dictionary content in the Series I didn’t manage to get it to generate the same Dictionary (still gave "Dictionary replacement detected…") But is this process making sense? Am I missing something? I can probably speed that up (need to figure out how to vectorize looking up indexes in an array) but before spending a lot of time doing that I just wanted to check whether the approach was sane at all. Full sample code (that works, but is super slow) at https://gist.github.com/bivald/f8e0a7625af2eabbf7c5fa055da91d61 Regards, Niklas [1] RecordBatchStream instead works, but I got slower read times using it... but might need to redo my timings [2] This is a continuation of https://stackoverflow.com/questions/72438916/batch-by-batch-convert-parquet-to-arrow-with-categorical-values-arrow-ipc-files