Hi all,
i just pushed into https://github.com/tomatophantastico/metamodel ,
into the accelerateCarthesianProduct branch.
The code is clearly not production ready, but you might want to look at it.
Let me explain:
1., I added a test, that calculates the join between an employee and a
department table, each with 10,000 tuples.
This test terminates with a heap-overflow on my machine.
2., i implemented all the points, below, which pushed the execution time
to 100 seconds,
Profiling the app, i discovered two things:
3. he BaseObject.hashcode implementation was now causing a lot of load,
as i as constantly putting/getting them from hashmaps. I overrode this
just out of curiosity, which improved performance a lot. After doing
P.2. this was however no longer necessary.
4. The System.arraycopy was also causing a lot of load, this changed the
visibility of FilterItem.compare(), to only copy tuples matching the
filter items.
BTW, where do you usually have those discussions? JIRA, or Github or
this mailinglist?
Best,
Jörg
Am 03.05.16 um 12:06 schrieb Jörg Unbehauen:
Hi Kasper,
i'd happily contribute, actually i already started working on it, but
soon discovered, that there might be a lot of side effects.
So basic ideas i had were:
1. Start with in memory datasets
2. Stream non-rewindable datasets
3. Directy apply filters on every row created.
4. Join first between datasets with filters, in order to prevent
cathesian products.
And further:
5. When joining, avoid loops, but build indexes for the in memory
datasets. Not sure about this one, though.
Feedback is appreciated.
I already started a fork
(https://github.com/tomatophantastico/metamodel) , as soon as it
works, i'll write an email again, before any pull requests.
Best,
Jörg
Am 02.05.16 um 06:56 schrieb Kasper Sørensen:
Hi Jörg,
You're right about the very naive behaviour of that method. It could
_certainly_ use an optimization or two. I can only speak for myself,
but I
just never used MetaModel much for joins and thus never gave it much
thought. Looking at the code I'm thinking that we can do much better.
Would you be interested in working on improving this condition? If so I
will happily share insights and ideas on how we can pull it off.
Cheers,
Kasper
2016-05-01 4:12 GMT-07:00 Jörg Unbehauen <
[email protected]>:
Hi all,
we just tried out metamodel with mongodb and tried out a simple join
(as
in select * from t1 join t2 on (t1.id = t2.oid) ) between two
collections
each containing roughly 10,000 documents. Using a developer setup on
a mac,
we did not get a result, as the system was more or less stuck.
A quick examination revealed that
MetaModelHelper.getCarthesianProduct(DataSet[] fromDataSets,
Iterable<FilterItem> whereItems) consumes most of the resources.
This implementation first computes the carthesian product in memory and
than applies filters on it.
I wonder what the rationale behind this implementation is, as it
will not
scale well, even for selective joins.
Or am i using Metamodel wrong here, as in: The join should never be
computed by getCarthesianProduct().
The problem appears to me as a general one, i did not supply a code
example.
Best,
Jörg