Re: Use-case question: Can Camel be used for dynamic ETL-tool?

James Strachan Wed, 25 Jul 2007 12:08:07 -0700

On 7/25/07, Britske <[EMAIL PROTECTED]> wrote:

Thanks for the quick reply: sounds good.


Do you know of any example code that would link camel and hibernate
together?


The use of the JPA endpoint is a good start...
http://cwiki.apache.org/CAMEL/jpa.html
(I've just updated the docs hence the strange URL)

basically this component takes any entity bean (a POJO with an @Entity
annotation) and stores it in a JPA provider like Hibernate, OpenJPA or
TopLink.

So if you parsed some file then transformed it into some entity bean
you could then persist it in hibernate by just routing it to a JPA
endpoint.

The JPA endpoint basically assumes that its given an enitty bean in
the message body and persists it; so the endpoint can deal with any
kind of JPA-enabled POJO. You could use multiple endpoints for
different persistence contexts (e.g. different DBs or schemas etc) if
you need to.

As being able to change the target-datamodel would be a
requirement as well, while maintaining the target-schema (Hibernate can be
used for the abstraction of this. not sure about performance through
hibernate for bulk-updates though..)


Using batch transactions should really help performance. Nothings ever
close to the performance of the raw DB dump tools that the database
vendors provide; but using bulk-updates with large transaction batches
should be pretty fast.

Another thing, just to be certain, leveraging the power op ActiveMQ would
enable the ETL-tool to easily scale over multiple servers / processors /
threads right?


Definitely. You can use the ActiveMQ component to load balance across
consumers on an ActiveMQ queue...
http://activemq.apache.org/camel/activemq.html

or you could use another JMS provider of your choice (though why use
any other provider when ActiveMQ is so good? :)
http://activemq.apache.org/camel/jms.html

finally if you want you could use in-JVM load balancing across a
thread pool (which is fine until you get CPU bound on a box)
http://activemq.apache.org/camel/seda.html

I havn't used this stack (ActiveMQ / Camel) before at all,


If you're new to ActiveMQ I'd recommend starting with SEDA to get to
grips with asynchronous SEDA based processing;
http://activemq.apache.org/camel/seda.html

then move on to distributed SEDA (using JMS queues) later on when
you're feeling more confident.

but the message-paradigm seems to be the perfect solution for this. With the
big proability of stating the obvious ;-).

:)

This would mean that pipelines operating in different threads or even
different servers need to be able to handle shared queues with all the
locking / concurrency stuff and pipelines in different servers being able to
ping eachother to go to work. Is this possible as well?  Is this what you
are referring to as parallel SEDA queues? I think I have to read up..


Yeah. Whether using the SEDA or a JMS component, each thread is going
to process things concurrently. So you may want to consider ordering
and concurrency issues and for some things you may need some kind of
concurrency lock. Its a whole large topic in and of itself ;-) - but
the quick 30,000ft view is...

* for distributed locking, try use a database by default, they are
very good at it :)

* to preserve ordering across a JMS queue consider using Exclusive Consumers
http://activemq.apache.org/exclusive-consumer.html

or even better, Message Groups which allows you to preserve ordering
across messages while still offering parallelisation via the
JMSXGrouopID header to determine what can be parallelized
http://activemq.apache.org/message-groups.html


A good rule of thumb to help reduce concurrency problems is to make
sure each single can be processed as an atomic unit in parallel
(either without concurrency issues or using say, database locking); or
if it can't, use a Message Group to relate the messages together which
need to be processed in order by a single thread.

Last thing: the ETL -tool can get various inputs.  One of which is a
webcrawler which is scheduled peridically to get some html (based on
patterns or whatever). Would/could such a multithreaded crawler in your
opinion be an ingral part of the etl-tool cq. one of the
'input-camel-pipelines'?


Definitely! We should certainly do a web crawler/spider component.
We've got a file crawler so far, but not a web one yet.

pros and cons would be highly appreciated! ;-)


You could certainly use any off-the-shelf spiders to then create
files, that Camel could spider today. Or you could plugin to some Java
spider and when a new page is hit you could send a message into a
Camel endpoint using a CamelTemplate.

Ideally though we'd create a web spider component so it'd be really
easy to setup EIP routes using a web spider as input - then we can use
the full power of the Enterprise Integration Patterns and Camel within
the web spider.

this sounds exciting!


Agreed! :)

Thanks in advance,


You're most welcome!


--
James
-------
http://macstrac.blogspot.com/

Re: Use-case question: Can Camel be used for dynamic ETL-tool?

Reply via email to