[
https://issues.apache.org/jira/browse/ARROW-17216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573143#comment-17573143
]
Weston Pace commented on ARROW-17216:
-------------------------------------
That'd be great. The starting point would be
`src/arrow/compute/exec/hash_join_node.cc`. This is where you'll find the
check itself that is currently failing, but this is not where most of the join
logic lives. Fair warning: the hash-join node has been a bit of a staging
ground for performance-critical arrow compute and so it relies on a number of
utilities not used elsewhere. As such, this node has a pretty high learning
curve at the moment (though my hope is that is more diffusely spread throughout
the engine in the future).
As of the 9.0.0 release (still pending) there are two implementations of
hash-join. The basic implementation (HashJoinImpl) is backed by
std::unordered_map and can be found in src/arrow/compute/exec/hash_join.h. A
newer version (SwissJoin) extends HashJoinImpl and is backed by a custom hash
map and is found in src/arrow/compute/exec/swiss_join.h. I'd recommend testing
and adding support to the newer version as the work required is going to be
similar between the two. Note that the basic version supports dictionary types
but not the newer version (and we just fall back to the basic version if
needed) so that is an option if the newer version proves to be trouble.
Support for types here is mostly gated by support for some of the alternate
views/encodings used by the hash join. One of these is a non-owning arraydata
view called KeyColumnArray which is in src/arrow/compute/light_array.h. This
view does not currently supported nested data. Note that ArraySpan is pretty
similar (see ARROW-17257) and does support nested types (I think) so maybe it
makes sense to tackle ARROW-17257 as part of this.
The second significant thing is RowTableImpl in
src/arrow/compute/row/row_internal.h. This implements a row-major encoding for
Arrow data. During the hash-join operation, the build data is placed into a
table in this row-major form. Then, during materialization, it is converted
back to a column-major form.
On top of those two key elements there are a number of other utilities like
ExecBatchBuilder, RowArray (which should maybe be renamed to RowTable),
RowArrayAccessor, RowArrayMerge, the hashing utilities themselves (there are
two versions of this too, I'm pretty sure the older implementation uses
arrow/util/hashing.h and I know the newer version uses
arrow/compute/exec/key_hash.h), etc.
So I would probably start by looking at the unit tests that exists for those
utilities encodings (this reminded me that I had some unit tests I had
forgotten to push for ARROW-17022 so I will try and get those up today) and try
to get these utilities working with nested types. Some of these utilities
could probably also use some more unit tests too. Once the utilities are
working with nested types you can enable them for the join itself and see what
breaks.
CC [~michalno] and [~sakras] as they are more knowledgeable in this area and
might have some additional input / advice.
> [C++] Support joining tables with non-key fields as list
> --------------------------------------------------------
>
> Key: ARROW-17216
> URL: https://issues.apache.org/jira/browse/ARROW-17216
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jayjeet Chakraborty
> Priority: Major
> Labels: query-engine
>
> I am trying to join 2 Arrow tables where some columns are of {{list<float>}}
> data type. Note that my join columns/keys are primitive data types and some
> my non-join columns/keys are of {{{}list<float>{}}}. But, PyArrow {{join()}}
> cannot join such as table, although pandas can. It says
> {{ArrowInvalid: Data type list<item: float> is not supported in join non-key
> field}}
> when I execute this piece of code
> {{joined_table = table_1.join(table_2, ['k1', 'k2', 'k3'])}}
> A
> [stackoverflow|https://stackoverflow.com/questions/73071105/listitem-float-not-supported-in-join-non-key-field]
> response pointed out that Arrow currently cannot handle non-fixed types for
> joins. Can this be fixed ? Or is this intentional ?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)