Hello, I'd like to try and contribute a fix for being able to *read *(leaving write for future work, but not too far behind) in C++ (and pyarrow) IPC streams where multiple columns share the same dictionary. See the below (originally to user@) for some context. Although the original query talks only about writing, reading doesn't work either.
I've played around with a local patch that seems adequate - i.e. it can read IPC streams with shared dicts that were generated in Java, and they come out as the appropriate categoricals in pandas. The advantage of supporting only read right now is that it should require very few changes - and work completely transparently - whereas write is a bit trickier, the public interfaces currently not being set up for it (I might be mistaken about this). For my personal objectives read is also currently sufficient (as I can just write from Java in production). The disadvantage is that we'd probably need a arrow/testing/data file for now to test this, and can't use the roundtrip yet. Given the above, - does it sound sensible to contribute only read for now, or should we aim wider and do write as well? - should this be a new JIRA or fall under https://issues.apache.org/jira/browse/ARROW-5340 (e.g. as a subtask if you use that). (I expect to find all useful administrative info in https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst but do let me know if there are other handy resources) -J ---------- Forwarded message --------- From: Joris Peeters <joris.mg.peet...@gmail.com> Date: Fri, Feb 26, 2021 at 10:11 AM Subject: Re: pyarrow: write table where columns share the same dictionary To: <u...@arrow.apache.org> FWIW, in the Java client it's https://github.com/apache/arrow/blob/apache-arrow-3.0.0/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamReader.java#L131 that's causing the aforementioned stackoverflow when reading lots of dictionaries from a stream. i.e. the recursive construct public boolean loadNextBatch() throws IOException { .. if (..) return true; else { .. return loadNextBatch(); } } Not sure if that qualifies as a bug, as I think the depth is typically multiple thousands, but perhaps of interest. On Thu, Feb 25, 2021 at 8:11 PM Wes McKinney <wesmck...@gmail.com> wrote: > I'm not sure if it's possible at the moment, but it SHOULD be made > possible. See ARROW-5340 > > On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters > <joris.mg.peet...@gmail.com> wrote: > > > > Hello, > > > > I have a pandas DataFrame with many string columns (>30,000), and they > share a low-cardinality set of values (e.g. size 100). I'd like to convert > this to an Arrow table of dictionary encoded columns (let's say int16 for > the index cols), but with just one shared dictionary of strings. > > This is to avoid ending up with >30,000 tiny dictionaries on the wire, > which doesn't even load in e.g. Java (due to a stackoverflow error). > > > > Despite my efforts, I haven't really been able to achieve this with the > public API's I could find. Does anyone have an idea? I'm using pyarrow > 3.0.0. > > > > For a mickey mouse example, I'm looking at e.g. > > > > df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', > 'foo']}) > > > > and would like a Table with dictionary-encoded columns a and b, both > nullable, that both refer to the same dictionary with id=0 (or whatever id) > containing ['foo', 'bar', 'quux']. > > > > Thanks, > > -Joris. > > > > > > > > > > > > > > >