Re: Join scalability for default implementation

Kasper Sørensen Tue, 03 May 2016 21:19:23 -0700

Hi Jörg,

Thank you for the great work so far! Discussing here is great, JIRA also
works, but I think it usually gets a bit more direct and personal/friendly
via the mailing list, which I like ;-)


Hmm yes the whole BaseObject thing is kinda nasty. It's one of those things
I'd like to get rid of, but it would break a couple of dependent projects
IMO so it should happen as part of a major version bump. Not that we don't
want to do it.

Reg. 4 - which arraycopy invocation are you referring to? Something in
FilterItem presumably? I couldn't find it.

I see that some of the unittests are breaking on your branch, so that would
be the minimal requirement from my side at least, that we bring them back
to green.

I saw something that could be optimized, which is your loading of rows into
a List<Row>. For that there is already a DataSet.toRows() method which will
in some cases be much faster (and less memory consuming) than to run
through the dataset. And it would also have the benefit of having the same
method call for both in-memory and other datasets.

A note on your "instanceof InMemoryDataSet" checks: Actually we quite often
wrap datasets on top of other datasets. So probably you will want to check
that instead. You can add a method like this maybe:

    public static boolean isInMemory(final DataSet ds) {
>         DataSet childDataSet = ds;
>         while (childDataSet instanceof WrappingDataSet) {
>             if (childDataSet instanceof InMemoryDataSet) {
>                 return true;
>             }
>             childDataSet =
> ((WrappingDataSet)childDataSet).getWrappedDataSet();
>         }
>         return false;
>     }


That will make your checks catch more scenarios, e.g. where an
InMemoryDataSet is maybe decorated by a FilteredDataSet or a MaxRowsDataSet
or so.

Other than that, keep up the great work. Let me know if I can help. Looks
like you simply need to work a bit more on the corner cases found by the
unittests etc.

BR,
Kasper


2016-05-03 6:09 GMT-07:00 Jörg Unbehauen <
[email protected]>:

> Hi all,
>
> i just pushed into https://github.com/tomatophantastico/metamodel ,  into
> the accelerateCarthesianProduct branch.
>
> The code is clearly not production ready, but you might want to look at it.
>
> Let me explain:
>
> 1., I added a test, that calculates the join between an employee and a
> department table, each with 10,000 tuples.
>
> This test terminates with a heap-overflow on my machine.
>
> 2., i implemented all the points, below, which pushed the execution time
> to 100 seconds,
>
> Profiling the app, i discovered two things:
>
> 3. he BaseObject.hashcode implementation was now causing a lot of load, as
> i as constantly putting/getting them from hashmaps. I overrode this just
> out of curiosity, which improved performance a lot. After doing P.2. this
> was however no longer necessary.
>
> 4. The System.arraycopy was also causing a lot of load, this changed the
> visibility of FilterItem.compare(), to only copy tuples matching the filter
> items.
>
> BTW, where do you usually have those discussions? JIRA, or Github or this
> mailinglist?
>
> Best,
>
> Jörg
>
>
>
>
>
> Am 03.05.16 um 12:06 schrieb Jörg Unbehauen:
>
> Hi Kasper,
>>
>> i'd happily contribute, actually i already started working on it, but
>> soon discovered, that there might be a lot of side effects.
>>
>> So basic ideas i had were:
>>
>> 1. Start with in memory datasets
>>
>> 2. Stream non-rewindable datasets
>>
>> 3. Directy apply filters on every row created.
>>
>> 4. Join first between datasets with filters, in order to prevent
>> cathesian products.
>>
>> And further:
>>
>> 5. When joining, avoid loops, but build indexes for the in memory
>> datasets. Not sure about this one, though.
>>
>> Feedback is appreciated.
>>
>> I already started a fork (https://github.com/tomatophantastico/metamodel)
>> , as soon as it works, i'll write an email again, before any pull requests.
>>
>> Best,
>>
>> Jörg
>>
>>
>> Am 02.05.16 um 06:56 schrieb Kasper Sørensen:
>>
>>> Hi Jörg,
>>>
>>> You're right about the very naive behaviour of that method. It could
>>> _certainly_ use an optimization or two. I can only speak for myself, but
>>> I
>>> just never used MetaModel much for joins and thus never gave it much
>>> thought. Looking at the code I'm thinking that we can do much better.
>>>
>>> Would you be interested in working on improving this condition? If so I
>>> will happily share insights and ideas on how we can pull it off.
>>>
>>> Cheers,
>>> Kasper
>>>
>>> 2016-05-01 4:12 GMT-07:00 Jörg Unbehauen <
>>> [email protected]>:
>>>
>>> Hi all,
>>>>
>>>> we just tried out metamodel with mongodb and tried out a simple join (as
>>>> in select * from t1 join t2 on (t1.id = t2.oid) ) between two
>>>> collections
>>>> each containing roughly 10,000 documents. Using a developer setup on a
>>>> mac,
>>>> we did not get a result, as the system was more or less stuck.
>>>> A quick examination revealed that
>>>> MetaModelHelper.getCarthesianProduct(DataSet[] fromDataSets,
>>>> Iterable<FilterItem> whereItems) consumes most of the resources.
>>>> This implementation first computes the carthesian product in memory and
>>>> than applies filters on it.
>>>> I wonder what the rationale behind this implementation is, as it will
>>>> not
>>>> scale well, even for selective joins.
>>>> Or am i using Metamodel wrong here, as in: The join should never be
>>>> computed by getCarthesianProduct().
>>>>
>>>> The problem appears to me as a general one, i did not supply a code
>>>> example.
>>>>
>>>> Best,
>>>>
>>>> Jörg
>>>>
>>>>
>>>>
>>>>
>>
>

Re: Join scalability for default implementation

Reply via email to