There's a relevant Jira issue here (maybe some others), if someone
wants to pick it up and write a kernel for it

https://issues.apache.org/jira/browse/ARROW-4097

I think having an improved experience around this dictionary
conformance/normalization problem would be valuable.

On Tue, May 31, 2022 at 6:24 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> I don't think you are missing anything.  The parquet encoding is baked
> into the data on the disk so re-encoding at some stage is inevitable.
> Re-encoding in python like you are doing is going to be inefficient.
> I think you will want to do the re-encoding in C++.  Unfortunately, I
> don't think we have a kernel function for this.
>
> I think one could be created.  The input would be the current
> dictionary and an incoming dictionary array.  The output would be a
> newly encoded array and a delta dictionary.  The dictionary builder is
> already implemented in this way[1] so the logic is probably lying
> around.
>
> If such a kernel function were implemented it would be nice to extend
> the IPC file writer to support using it internally.  Then it would be
> invisible to the general Arrow user (they could just use the record
> batch file writer with this option turned on).
>
> On Tue, May 31, 2022 at 8:58 AM Niklas Bivald <niklas.biv...@enplore.com> 
> wrote:
> >
> > Hi,
> >
> > Background:
> > I have a need to optimize read speed for few-column lookups in large
> > datasets. Currently I have the data in Plasma to have fast reading of it,
> > but Plasma is cumbersome to manage when the data frequently changes (and
> > “locks” the ram). Instead I’m trying to figure out a fast-enough approach
> > to read columns from an arrow file from disk (100-200ms range). Reading
> > from an Arrow file appears to be fast enough (even though I unfortunately
> > have string values so zero-copy is out of the question). However, to read
> > an arrow file I first need to generate it.
> >
> > Current solution:
> > I’m re-writing a parquet file to an arrow file row group per row group
> > (dataset is bigger than RAM). Initially I had a naive implementation that
> > read a batch from the parquet (using pyarrow.parquet) and tried to write it
> > to a RecordBatchFileWriter. However that quickly led to:
> >
> > pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC
> > file format. Arrow IPC files only support a single non-delta dictionary for
> > a given field across all batches.[1]
> >
> > Trying to fix this (depending on my solution) I had a question:
> >
> >
> >    1. Is there a way when creating the pyarrow schema define what
> >    categorical values exist in the dictionaries? Or force a specific
> >    dictionary when using `pa.DictionaryArray.from_pandas`
> >
> >
> > Right now I use `pa.DictionaryArray.from_arrays` with the same dictionary
> > values as an array, but it’s pretty cumbersome since I basically - per
> > column per row group - need to convert the column values into the indices.
> > Naive implementation:
> >
> > >>> def create_dictionary_array_indices(column_name, arrow_array):
> > >>>  global categories_columns
> > >>>  values = arrow_array.to_pylist()
> > >>>  indices = []
> > >>>  for i, value in enumerate(values):
> > >>>  if not value or value != value:
> > >>>  indices.append(None)
> > >>>  else:
> > >>>  indices.append(
> > >>>  dictionary_values[column_name].index(value)
> > >>>  )
> > >>>  indices = pd.array(indices, dtype=pd.Int32Dtype())
> > >>>  return pa.DictionaryArray.from_arrays(indices,
> > dictionary_values[column_name])
> >
> > I also tried using pa.DictionaryArray.from_pandas with Series, but even
> > though I had the same dictionary content in the Series I didn’t manage to
> > get it to generate the same Dictionary (still gave "Dictionary replacement
> > detected…")
> >
> > But is this process making sense? Am I missing something? I can probably
> > speed that up (need to figure out how to vectorize looking up indexes in an
> > array) but before spending a lot of time doing that I just wanted to check
> > whether the approach was sane at all. Full sample code (that works, but is
> > super slow) at
> > https://gist.github.com/bivald/f8e0a7625af2eabbf7c5fa055da91d61
> >
> > Regards,
> > Niklas
> >
> > [1] RecordBatchStream instead works, but I got slower read times using
> > it... but might need to redo my timings
> > [2] This is a continuation of
> > https://stackoverflow.com/questions/72438916/batch-by-batch-convert-parquet-to-arrow-with-categorical-values-arrow-ipc-files

Reply via email to