[ 
https://issues.apache.org/jira/browse/ARROW-17216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573143#comment-17573143
 ] 

Weston Pace commented on ARROW-17216:
-------------------------------------

That'd be great.  The starting point would be 
`src/arrow/compute/exec/hash_join_node.cc`.  This is where you'll find the 
check itself that is currently failing, but this is not where most of the join 
logic lives.  Fair warning: the hash-join node has been a bit of a staging 
ground for performance-critical arrow compute and so it relies on a number of 
utilities not used elsewhere.  As such, this node has a pretty high learning 
curve at the moment (though my hope is that is more diffusely spread throughout 
the engine in the future).

As of the 9.0.0 release (still pending) there are two implementations of 
hash-join.  The basic implementation (HashJoinImpl) is backed by 
std::unordered_map and can be found in src/arrow/compute/exec/hash_join.h.  A 
newer version (SwissJoin) extends HashJoinImpl and is backed by a custom hash 
map and is found in src/arrow/compute/exec/swiss_join.h.  I'd recommend testing 
and adding support to the newer version as the work required is going to be 
similar between the two.  Note that the basic version supports dictionary types 
but not the newer version (and we just fall back to the basic version if 
needed) so that is an option if the newer version proves to be trouble.

Support for types here is mostly gated by support for some of the alternate 
views/encodings used by the hash join.  One of these is a non-owning arraydata 
view called KeyColumnArray which is in src/arrow/compute/light_array.h.  This 
view does not currently supported nested data.  Note that ArraySpan is pretty 
similar (see ARROW-17257) and does support nested types (I think) so maybe it 
makes sense to tackle ARROW-17257 as part of this.

The second significant thing is RowTableImpl in 
src/arrow/compute/row/row_internal.h.  This implements a row-major encoding for 
Arrow data.  During the hash-join operation, the build data is placed into a 
table in this row-major form.  Then, during materialization, it is converted 
back to a column-major form.

On top of those two key elements there are a number of other utilities like 
ExecBatchBuilder, RowArray (which should maybe be renamed to RowTable), 
RowArrayAccessor, RowArrayMerge, the hashing utilities themselves (there are 
two versions of this too, I'm pretty sure the older implementation uses 
arrow/util/hashing.h and I know the newer version uses 
arrow/compute/exec/key_hash.h), etc.

So I would probably start by looking at the unit tests that exists for those 
utilities encodings (this reminded me that I had some unit tests I had 
forgotten to push for ARROW-17022 so I will try and get those up today) and try 
to get these utilities working with nested types.  Some of these utilities 
could probably also use some more unit tests too.  Once the utilities are 
working with nested types you can enable them for the join itself and see what 
breaks.

CC [~michalno] and [~sakras] as they are more knowledgeable in this area and 
might have some additional input / advice.

> [C++] Support joining tables with non-key fields as list
> --------------------------------------------------------
>
>                 Key: ARROW-17216
>                 URL: https://issues.apache.org/jira/browse/ARROW-17216
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jayjeet Chakraborty
>            Priority: Major
>              Labels: query-engine
>
> I am trying to join 2 Arrow tables where some columns are of {{list<float>}} 
> data type. Note that my join columns/keys are primitive data types and some 
> my non-join columns/keys are of {{{}list<float>{}}}. But, PyArrow {{join()}} 
> cannot join such as table, although pandas can. It says
> {{ArrowInvalid: Data type list<item: float> is not supported in join non-key 
> field}}
> when I execute this piece of code
> {{joined_table = table_1.join(table_2, ['k1', 'k2', 'k3'])}}
> A 
> [stackoverflow|https://stackoverflow.com/questions/73071105/listitem-float-not-supported-in-join-non-key-field]
>  response pointed out that Arrow currently cannot handle non-fixed types for 
> joins. Can this be fixed ? Or is this intentional ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to