There's a relevant Jira issue here (maybe some others), if someone wants to pick it up and write a kernel for it
https://issues.apache.org/jira/browse/ARROW-4097 I think having an improved experience around this dictionary conformance/normalization problem would be valuable. On Tue, May 31, 2022 at 6:24 PM Weston Pace <weston.p...@gmail.com> wrote: > > I don't think you are missing anything. The parquet encoding is baked > into the data on the disk so re-encoding at some stage is inevitable. > Re-encoding in python like you are doing is going to be inefficient. > I think you will want to do the re-encoding in C++. Unfortunately, I > don't think we have a kernel function for this. > > I think one could be created. The input would be the current > dictionary and an incoming dictionary array. The output would be a > newly encoded array and a delta dictionary. The dictionary builder is > already implemented in this way[1] so the logic is probably lying > around. > > If such a kernel function were implemented it would be nice to extend > the IPC file writer to support using it internally. Then it would be > invisible to the general Arrow user (they could just use the record > batch file writer with this option turned on). > > On Tue, May 31, 2022 at 8:58 AM Niklas Bivald <niklas.biv...@enplore.com> > wrote: > > > > Hi, > > > > Background: > > I have a need to optimize read speed for few-column lookups in large > > datasets. Currently I have the data in Plasma to have fast reading of it, > > but Plasma is cumbersome to manage when the data frequently changes (and > > “locks” the ram). Instead I’m trying to figure out a fast-enough approach > > to read columns from an arrow file from disk (100-200ms range). Reading > > from an Arrow file appears to be fast enough (even though I unfortunately > > have string values so zero-copy is out of the question). However, to read > > an arrow file I first need to generate it. > > > > Current solution: > > I’m re-writing a parquet file to an arrow file row group per row group > > (dataset is bigger than RAM). Initially I had a naive implementation that > > read a batch from the parquet (using pyarrow.parquet) and tried to write it > > to a RecordBatchFileWriter. However that quickly led to: > > > > pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC > > file format. Arrow IPC files only support a single non-delta dictionary for > > a given field across all batches.[1] > > > > Trying to fix this (depending on my solution) I had a question: > > > > > > 1. Is there a way when creating the pyarrow schema define what > > categorical values exist in the dictionaries? Or force a specific > > dictionary when using `pa.DictionaryArray.from_pandas` > > > > > > Right now I use `pa.DictionaryArray.from_arrays` with the same dictionary > > values as an array, but it’s pretty cumbersome since I basically - per > > column per row group - need to convert the column values into the indices. > > Naive implementation: > > > > >>> def create_dictionary_array_indices(column_name, arrow_array): > > >>> global categories_columns > > >>> values = arrow_array.to_pylist() > > >>> indices = [] > > >>> for i, value in enumerate(values): > > >>> if not value or value != value: > > >>> indices.append(None) > > >>> else: > > >>> indices.append( > > >>> dictionary_values[column_name].index(value) > > >>> ) > > >>> indices = pd.array(indices, dtype=pd.Int32Dtype()) > > >>> return pa.DictionaryArray.from_arrays(indices, > > dictionary_values[column_name]) > > > > I also tried using pa.DictionaryArray.from_pandas with Series, but even > > though I had the same dictionary content in the Series I didn’t manage to > > get it to generate the same Dictionary (still gave "Dictionary replacement > > detected…") > > > > But is this process making sense? Am I missing something? I can probably > > speed that up (need to figure out how to vectorize looking up indexes in an > > array) but before spending a lot of time doing that I just wanted to check > > whether the approach was sane at all. Full sample code (that works, but is > > super slow) at > > https://gist.github.com/bivald/f8e0a7625af2eabbf7c5fa055da91d61 > > > > Regards, > > Niklas > > > > [1] RecordBatchStream instead works, but I got slower read times using > > it... but might need to redo my timings > > [2] This is a continuation of > > https://stackoverflow.com/questions/72438916/batch-by-batch-convert-parquet-to-arrow-with-categorical-values-arrow-ipc-files