Fwd: pyarrow: write table where columns share the same dictionary

Joris Peeters Mon, 01 Mar 2021 11:04:35 -0800

Hello,

I'd like to try and contribute a fix for being able to *read *(leaving
write for future work, but not too far behind) in C++ (and pyarrow) IPC
streams where multiple columns share the same dictionary. See the below
(originally to user@) for some context. Although the original query talks
only about writing, reading doesn't work either.

I've played around with a local patch that seems adequate - i.e. it can
read IPC streams with shared dicts that were generated in Java, and they
come out as the appropriate categoricals in pandas.

The advantage of supporting only read right now is that it should require
very few changes - and work completely transparently - whereas write is a
bit trickier, the public interfaces currently not being set up for it (I
might be mistaken about this).
For my personal objectives read is also currently sufficient (as I can just
write from Java in production).
The disadvantage is that we'd probably need a arrow/testing/data file for
now to test this, and can't use the roundtrip yet.

Given the above,
- does it sound sensible to contribute only read for now, or should we aim
wider and do write as well?
- should this be a new JIRA or fall under
https://issues.apache.org/jira/browse/ARROW-5340 (e.g. as a subtask if you
use that).

(I expect to find all useful administrative info in
https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst
but do let me know if there are other handy resources)

-J

---------- Forwarded message ---------
From: Joris Peeters <[email protected]>
Date: Fri, Feb 26, 2021 at 10:11 AM
Subject: Re: pyarrow: write table where columns share the same dictionary
To: <[email protected]>

FWIW, in the Java client it's
https://github.com/apache/arrow/blob/apache-arrow-3.0.0/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamReader.java#L131
that's causing the aforementioned stackoverflow when reading lots of
dictionaries from a stream.
i.e. the recursive construct

    public boolean loadNextBatch() throws IOException {
    ..
      if (..) return true;
      else {
        ..
        return loadNextBatch();
      }
    }

Not sure if that qualifies as a bug, as I think the depth is typically
multiple thousands, but perhaps of interest.

On Thu, Feb 25, 2021 at 8:11 PM Wes McKinney <[email protected]> wrote:

> I'm not sure if it's possible at the moment, but it SHOULD be made
> possible. See ARROW-5340
>
> On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
> <[email protected]> wrote:
> >
> > Hello,
> >
> > I have a pandas DataFrame with many string columns (>30,000), and they
> share a low-cardinality set of values (e.g. size 100). I'd like to convert
> this to an Arrow table of dictionary encoded columns (let's say int16 for
> the index cols), but with just one shared dictionary of strings.
> > This is to avoid ending up with >30,000 tiny dictionaries on the wire,
> which doesn't even load in e.g. Java (due to a stackoverflow error).
> >
> > Despite my efforts, I haven't really been able to achieve this with the
> public API's I could find. Does anyone have an idea? I'm using pyarrow
> 3.0.0.
> >
> > For a mickey mouse example, I'm looking at e.g.
> >
> > df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux',
> 'foo']})
> >
> > and would like a Table with dictionary-encoded columns a and b, both
> nullable, that both refer to the same dictionary with id=0 (or whatever id)
> containing ['foo', 'bar', 'quux'].
> >
> > Thanks,
> > -Joris.
> >
> >
> >
> >
> >
> >
> >
>

Fwd: pyarrow: write table where columns share the same dictionary

Reply via email to