Re: Parallel SQL join on multivalued fields

Joel Bernstein Wed, 22 Jul 2020 05:58:17 -0700

I think the first step would be comprehensive unit tests for joins in
Parallel SQL, coupled with performance tests so we understand how
distributed performs at scale through the calcites framework. Then we know
if we can actually say joins are really supported. Then we can add the
documentation.


If join support becomes part of parallel SQL then we can actively look at
improving them.

If you want to add the unit tests I can find the time to help commit and I
can help with the performance tests.




Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Jul 21, 2020 at 5:02 AM Piero Scrima <piersc...@gmail.com> wrote:

> any suggestion on this?
> Thanks
>
> Il giorno mer 1 lug 2020 alle ore 11:22 Piero Scrima <piersc...@gmail.com>
> ha scritto:
>
>> Hi,
>>
>> I don't know if this is the right place for my question, anyway I'll try
>> to explain the issue here and understand together with you if it's worth
>> working on it.
>> I'm working with the parallel sql feature of Solr. Even though , looking
>> at the documentation, the join seems not supported, the join works. I gave
>> a look to the code and I understood that it works thanks to the calcite
>> features, (the framework on top of which is is build parellel sql feature).
>> My project doesn't need to works with big amount of data, and I think that
>> the calcite join feature can work well for my use case.
>> The problems arise when I need to join two multivalued field.
>> In parallel sql, the JOIN operation on two multivalued field seems to
>> works matching the two fields as a unique string, so that, for example , a
>> document with the join field like this :["a","b"]; will match with a
>> document with a join field exactly equal, like this: ["a","b"]; otherwise
>> if even only one element is different, they do not match.The right way to
>> do this should be first to explode the document in more document as much as
>> the number of element in the multivalued field (cross product on the field)
>> and then perform the join. I managed to solve the problem using streaming
>> expression:
>>
>> innerJoin(
>>     sort(
>>         cartesianProduct(
>>
>> search(census_defence_system,q="*:*",fl="id,defence_system,description,supplier",sort="id
>> asc",qt="/select",rows="1000"),
>>           supplier
>>         ),
>>     by="supplier asc"
>>     ),
>>     sort(
>>       cartesianProduct(
>>
>> search(census_components,q="*:*",fl="id,compoenent_name,supplier",sort="id
>> asc",qt="/select",rows="10000"),
>>         supplier
>>         ),
>>         by="supplier asc"
>>     ),
>>   on="supplier"
>> )
>> with suplier as a multivalued field. It works very well.
>> Anyway it would be great if the JOIN of multivalued field performed this
>> behavior with cartesian product also in parallel sql.
>> I think this could be a very powerful improvement, apply sql in  a not
>> normalized collection/table. It could be possible to implement this
>> feature?  I will be very glad to work on it.
>>
>> Thank you,
>>
>> Piero
>>
>

Re: Parallel SQL join on multivalued fields

Reply via email to