Hi all,

i just pushed into https://github.com/tomatophantastico/metamodel , into the accelerateCarthesianProduct branch.

The code is clearly not production ready, but you might want to look at it.

Let me explain:

1., I added a test, that calculates the join between an employee and a department table, each with 10,000 tuples.

This test terminates with a heap-overflow on my machine.

2., i implemented all the points, below, which pushed the execution time to 100 seconds,

Profiling the app, i discovered two things:

3. he BaseObject.hashcode implementation was now causing a lot of load, as i as constantly putting/getting them from hashmaps. I overrode this just out of curiosity, which improved performance a lot. After doing P.2. this was however no longer necessary.

4. The System.arraycopy was also causing a lot of load, this changed the visibility of FilterItem.compare(), to only copy tuples matching the filter items.

BTW, where do you usually have those discussions? JIRA, or Github or this mailinglist?

Best,

Jörg





Am 03.05.16 um 12:06 schrieb Jörg Unbehauen:
Hi Kasper,

i'd happily contribute, actually i already started working on it, but soon discovered, that there might be a lot of side effects.

So basic ideas i had were:

1. Start with in memory datasets

2. Stream non-rewindable datasets

3. Directy apply filters on every row created.

4. Join first between datasets with filters, in order to prevent cathesian products.

And further:

5. When joining, avoid loops, but build indexes for the in memory datasets. Not sure about this one, though.

Feedback is appreciated.

I already started a fork (https://github.com/tomatophantastico/metamodel) , as soon as it works, i'll write an email again, before any pull requests.

Best,

Jörg


Am 02.05.16 um 06:56 schrieb Kasper Sørensen:
Hi Jörg,

You're right about the very naive behaviour of that method. It could
_certainly_ use an optimization or two. I can only speak for myself, but I
just never used MetaModel much for joins and thus never gave it much
thought. Looking at the code I'm thinking that we can do much better.

Would you be interested in working on improving this condition? If so I
will happily share insights and ideas on how we can pull it off.

Cheers,
Kasper

2016-05-01 4:12 GMT-07:00 Jörg Unbehauen <
[email protected]>:

Hi all,

we just tried out metamodel with mongodb and tried out a simple join (as in select * from t1 join t2 on (t1.id = t2.oid) ) between two collections each containing roughly 10,000 documents. Using a developer setup on a mac,
we did not get a result, as the system was more or less stuck.
A quick examination revealed that
MetaModelHelper.getCarthesianProduct(DataSet[] fromDataSets,
Iterable<FilterItem> whereItems) consumes most of the resources.
This implementation first computes the carthesian product in memory and
than applies filters on it.
I wonder what the rationale behind this implementation is, as it will not
scale well, even for selective joins.
Or am i using Metamodel wrong here, as in: The join should never be
computed by getCarthesianProduct().

The problem appears to me as a general one, i did not supply a code
example.

Best,

Jörg





Reply via email to