[kamaelia-list] Distributed data processing

Michael Sparks Sat, 11 Jul 2009 12:20:50 -0700

On Saturday 11 July 2009 01:15:08 manimal45 wrote:
> I read about Kamaelia investigation to be used for data modeling.
> I'm working on data migration projects where we have to handle tons of
> data, with performance issues as each single rows and fields for an
> entire database have to be read and transformed.
> i believe it's somehow the same issue one can face when dealing with
> massive medical data for research subjects.


Quite probably. Kamaelia should be well suited to this sort of task. Kamaelia 
is at it's core a data flow system really, and many of the earliest data 
processing systems were inherently data flow. The best way to find out of 
course though is to try it, so I find it interesting that you're trying 
it! :)

> I already posted a couple of questions some months ago when I first
> discovered Kamaelia (which i find to be so great, and I've converted
> some java friends to python and kamaelia).

Cool - that's really nice to hear. It's always nice to hear from anyone using 
Kamaelia. Incidentally, since you posted to the list last time, I've created 
and given a tutorial on kamaelia, which you can find here:

    * http://www.kamaelia.org/PragmaticConcurrency

It includes a number of step by step examples aimed at understanding the core, 
building components, and larger systems. The PDF is linked near the bottom of 
that page.      

> I came up with the idea to use Kamaelia to distribute data and queries
> across some nodes.

That explains the questions :-)

> Summer student worked with me and we've managed to have something
> running with sqlite nodes.

I'd be very interested to see this, if you're able to share. If you can't, no 
problem :-)

> We have some naive concepts :
> 1) data are distributed physically across nodes, without any key/range
> partitioning
> 2) user send queries to proxies
> 3) proxies redirects queries to nodes
> 4) nodes requests missing data across the system  (missing data can
> arise when joining tables together, and we've defined some tags to
> define parent and child tables), this is the most complex part of the
> system:
>     i) bloom filters are computed on column sets of a join
>     ii) bloom are sent to all other nodes
>     iii) matching data are computed on each node and sent back to
> requesting node
>     iv) bloom are stored so that bloom on a column set are never
> computed twice

This sounds like an optimised form of map-reduce. I like the use of bloom 
filters in this scenario, since if I understand what you're doing correctly, 
you do not need to store a central index and do not have an indexing 
algorithm based on the number of nodes. (I've come across them before)

That would suggest that you can just add more nodes with data to your system 
without any reindexing, etc.

> We do not use Json to send message across network, rather we just
> cpicke python native dictionaries.
> I don't know if it's the best idea I had, but it's rather simple.

I'm a great believer of picking something simple and moving forward with that.

> This system works quite well as we deal with dead data (no writes nor
> update on source data, just writes on target tables which we create on
> the fly).
> There's a strong overhead on first queries as bloom are computed and
> data are sent and received.
> But after first N queries, data are "self balancing", and there's no
> more data transport so that we can scale up to full parallelism.

This makes a lot of sense. If you want an example of something in a different 
domain that uses a similar technique, this also happens in clusters of squid 
http web caches. (they call their bloom filters cache digest)

> The big issue here is sqlite :
>     1) no type support
>     2) no transaction support (==> single user system)
>     3) some obscur bugs (ex: sqlite sometimes raises an Exception
> named "NotAnError", which is very hard to understand, don't you
> think? ;)).

I've not used sql-lite so probably can't speak to that. That *is* a rather odd 
error though :)

> As Python clearly lacks a good database API, we cannot move easily to
> another database.
> Dealing with Oracle, MySQL, Postgres from Python is not so easy, and
> DB API is never implemented the same way(!), which makes switching
> from a RDMS to another a pain.

This is something that has always surprised me about python. In the perl world 
there is one obvious way to do databases - using DBI::DBD, even though perl's 
motto is "there's more than one way to do it". Whereas in python there are 
multiple competing DB API's, despite the motto "there should be one obvious 
way to do it".

> JDBC use would be so nice.

I can understand that motivation.

> So we tried to use Kamaleia within Jython.
> We've managed to import Axon + Kamaelia modules "as is", and it
> surprisingly worked ... at some level.
> I can state a 99% Axon compatibility with Jython.

This is interesting, and nice to hear. If you have any patches, or fixes for 
that final 1% please post them to the list. Jython does not suffer from a 
GIL, and as a result offers some interesting possibilities. (patches are 
preferable to descriptions :-)

> Unfortunately, no TCPClient nor SimpleServer would run  with Jython.
> There's some CSA error being raised.

Again, bug reports always welcome. Preferably with minimal example code, and 
stack traces exhibiting the problem.

> Sadly, we've came to the conclusion that jdbc will not run during this
> summer. Do anyone know of some quick fix about it ? ( I should post
> this question on the jython page ?)

Since I've spent a lot of my time avoiding Java over the years, asking in a 
Jython forum would make sense.

> Kamaelia + Jython would clearly rock for the many practical cases
> where Java  (which I hate) stands as a standard  (JDBC, Swing ...) but
> lacks some aspects (like concurrency, message passing, and will to
> have a simpler life; do you see what i'm talking about?).

I absolutely understand :-)

> I'll get my student to write some slides about our work and share it
> with you.

This would be extremely interesting. I'd really like to have pages on 
kamaelia.org describing successful usecases of kamaelia, since aside from 
anything else they explain far better than anything else how kamaelia can be 
useful.

It also provides a means for people to share solutions more widely.

> Any comment, feedbacks welcomed !

I'd like to echo the two sentiments John made:
   * Please look into SQL Alchemy - I've heard some nice things about it, and
      there was a good article in the python magazine recently. I get the
      impression that many people are beginning to think of it as one of the
      more pythonic databases.

   * Please continue to let us know what you're doing and up to. Knowing that
      people are actively using Kamaelia is a great motivation for sharing our
      updates more regularly. It's a primary motivator for example of moving
      to monthly interim releases.

If you want to (and are able to!) share your code more widely, I'm more than 
happy to provide you with a friendly environment for doing so.

In the meantime I'm glad to hear you're finding kamaelia useful :-)

Best Regards,


Michael.
-- 
http://yeoldeclue.com/blog
http://twitter.com/kamaelian
http://www.kamaelia.org/Home

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"kamaelia" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/kamaelia?hl=en
-~----------~----~----~----~------~----~------~--~---

[kamaelia-list] Distributed data processing

Reply via email to