Hi,

Background:
I have a need to optimize read speed for few-column lookups in large
datasets. Currently I have the data in Plasma to have fast reading of it,
but Plasma is cumbersome to manage when the data frequently changes (and
“locks” the ram). Instead I’m trying to figure out a fast-enough approach
to read columns from an arrow file from disk (100-200ms range). Reading
from an Arrow file appears to be fast enough (even though I unfortunately
have string values so zero-copy is out of the question). However, to read
an arrow file I first need to generate it.

Current solution:
I’m re-writing a parquet file to an arrow file row group per row group
(dataset is bigger than RAM). Initially I had a naive implementation that
read a batch from the parquet (using pyarrow.parquet) and tried to write it
to a RecordBatchFileWriter. However that quickly led to:

pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC
file format. Arrow IPC files only support a single non-delta dictionary for
a given field across all batches.[1]

Trying to fix this (depending on my solution) I had a question:


   1. Is there a way when creating the pyarrow schema define what
   categorical values exist in the dictionaries? Or force a specific
   dictionary when using `pa.DictionaryArray.from_pandas`


Right now I use `pa.DictionaryArray.from_arrays` with the same dictionary
values as an array, but it’s pretty cumbersome since I basically - per
column per row group - need to convert the column values into the indices.
Naive implementation:

>>> def create_dictionary_array_indices(column_name, arrow_array):
>>>  global categories_columns
>>>  values = arrow_array.to_pylist()
>>>  indices = []
>>>  for i, value in enumerate(values):
>>>  if not value or value != value:
>>>  indices.append(None)
>>>  else:
>>>  indices.append(
>>>  dictionary_values[column_name].index(value)
>>>  )
>>>  indices = pd.array(indices, dtype=pd.Int32Dtype())
>>>  return pa.DictionaryArray.from_arrays(indices,
dictionary_values[column_name])

I also tried using pa.DictionaryArray.from_pandas with Series, but even
though I had the same dictionary content in the Series I didn’t manage to
get it to generate the same Dictionary (still gave "Dictionary replacement
detected…")

But is this process making sense? Am I missing something? I can probably
speed that up (need to figure out how to vectorize looking up indexes in an
array) but before spending a lot of time doing that I just wanted to check
whether the approach was sane at all. Full sample code (that works, but is
super slow) at
https://gist.github.com/bivald/f8e0a7625af2eabbf7c5fa055da91d61

Regards,
Niklas

[1] RecordBatchStream instead works, but I got slower read times using
it... but might need to redo my timings
[2] This is a continuation of
https://stackoverflow.com/questions/72438916/batch-by-batch-convert-parquet-to-arrow-with-categorical-values-arrow-ipc-files

Reply via email to