Re: Join scalability for default implementation

Jörg Unbehauen Tue, 03 May 2016 06:10:21 -0700

Hi all,

i just pushed into https://github.com/tomatophantastico/metamodel ,into the accelerateCarthesianProduct branch.


The code is clearly not production ready, but you might want to look at it.

Let me explain:

1., I added a test, that calculates the join between an employee and adepartment table, each with 10,000 tuples.


This test terminates with a heap-overflow on my machine.

2., i implemented all the points, below, which pushed the execution timeto 100 seconds,


Profiling the app, i discovered two things:

3. he BaseObject.hashcode implementation was now causing a lot of load,as i as constantly putting/getting them from hashmaps. I overrode thisjust out of curiosity, which improved performance a lot. After doingP.2. this was however no longer necessary.

4. The System.arraycopy was also causing a lot of load, this changed thevisibility of FilterItem.compare(), to only copy tuples matching thefilter items.

BTW, where do you usually have those discussions? JIRA, or Github orthis mailinglist?


Best,

Jörg





Am 03.05.16 um 12:06 schrieb Jörg Unbehauen:

Hi Kasper,
i'd happily contribute, actually i already started working on it, butsoon discovered, that there might be a lot of side effects.
So basic ideas i had were:

1. Start with in memory datasets

2. Stream non-rewindable datasets

3. Directy apply filters on every row created.
4. Join first between datasets with filters, in order to preventcathesian products.
And further:
5. When joining, avoid loops, but build indexes for the in memorydatasets. Not sure about this one, though.
Feedback is appreciated.
I already started a fork(https://github.com/tomatophantastico/metamodel) , as soon as itworks, i'll write an email again, before any pull requests.
Best,

Jörg


Am 02.05.16 um 06:56 schrieb Kasper Sørensen:
Hi Jörg,

You're right about the very naive behaviour of that method. It could
_certainly_ use an optimization or two. I can only speak for myself,but I
just never used MetaModel much for joins and thus never gave it much
thought. Looking at the code I'm thinking that we can do much better.

Would you be interested in working on improving this condition? If so I
will happily share insights and ideas on how we can pull it off.

Cheers,
Kasper

2016-05-01 4:12 GMT-07:00 Jörg Unbehauen <
[email protected]>:
Hi all,
we just tried out metamodel with mongodb and tried out a simple join(asin select * from t1 join t2 on (t1.id = t2.oid) ) between twocollectionseach containing roughly 10,000 documents. Using a developer setup ona mac,
we did not get a result, as the system was more or less stuck.
A quick examination revealed that
MetaModelHelper.getCarthesianProduct(DataSet[] fromDataSets,
Iterable<FilterItem> whereItems) consumes most of the resources.
This implementation first computes the carthesian product in memory and
than applies filters on it.
I wonder what the rationale behind this implementation is, as itwill not
scale well, even for selective joins.
Or am i using Metamodel wrong here, as in: The join should never be
computed by getCarthesianProduct().

The problem appears to me as a general one, i did not supply a code
example.

Best,

Jörg

Re: Join scalability for default implementation

Reply via email to