CZJUG Praha - Engineering Machine Learning Algorithms at Scale, Real-time stream data processing

Roman Pichlík Sun, 16 Jun 2013 23:22:47 -0700

Ahoj,
 Červnové setkání Pražské Czech Java User Group proběhne 26.6. od 19h
v posluchárně S5 na Matematicko-fyzikální fakultě Karlovy Univerzity
na Malostranském náměstí 25, Praha 1. Čekají nás prezentace
Engineering Machine Learning Algorithms at Scale (prof. Jan Vitek),
Real-time stream data processing (Zbyněk Šlajchrt). Sponzorem tohoto
setkání je firma AVAST Software. Vstup na akce CZJUGu je zdarma, a
není třeba se předem registrovat. Pokud se chystáte přijít, dejte nám
vědět formou hlasování v anketě na hlavní stránce portálu java.cz.


Engineering Machine Learning Algorithms at Scale

The talk will describe how to engineer a scalable implementation of a
popular supervised machine learning algorithm, Random Forest, so that
it can scale to terabyte data sets. To achieve this I will describe
how to leverage the H20 analytics engine to write Java distributed
Fork/Join code that is massively scalable and efficient. H20 has an
API for Big Data Math that uses a simple giant vector programming
style that runs in parallel across a cluster. H20 is able to run on
top on infrastructures like Hadoop or stand alone and has been shown
to scale to 100s of nodes. The H20 project is an open source effort
and so is our implementation of Random Forest.

Jan Vitek

Jan Vitek is a Professor of Computer Science at Purdue University,
USA. His research career encompasses work on all aspects of
programming language design and implementation. He lead the
development of the first real-time Java virtual machine, he worked on
language-based security, concurrency and transactional memory. On the
academic side of his life he chairs SIGPLAN, the ACM Special Interest
Group on Programming Languages and chaired conferences such as ECOOP,
PLDI, COORDINATION and TOOLS. He was an academic visitor for several
years at IBM and Oracle. He cofounded Fiji Systems to sell real-time
technology and he is currently an advisor at 0xdata where he works on
big data. His most recent research interests include JavaScript and
the R programming language.

Real-time stream data processing

This presentation deals with the concept of coroutines and its
applicability in the world of stream data processing. Although it is
rarely used in the todays applications, the coroutines have been here
since the early days of digital computing. Surprisingly, coroutines
can be nicely combined with the map-reduce paradigm that is used
frequently in the world of cloud computing and big data processing. In
contrast to the traditional map-reduce concept, which is designed for
offline job processing, the coroutines&map-reduce hybrid is primarily
targeted at real-time event processing. Clockwork, an open-source
library developed at Avast, combines these two concepts and allows a
programmer to write a real-time stream analysis as if he wrote a
traditional map-reduce job for Hadoop, for instance. The presentation
is focused mainly on coding and samples and will show how to program
applications ranging from simple real-time statistics to more advanced
tasks.

Zbyněk Šlajchrt

After finishing studies at Faculty of Mathematics and Physics at the
Charles University, he began to work as Java EE developer and
architect at several Czech and international companies. Now he works
at Avast a.s. and aside his main job he gives lectures of Java EE
programming at the University of Economics, Prague. In the current
position he is responsible for designing and developing a private
cloud platform and applications build on the Java platform in AVAST
Software.


--
S pozdravem Roman "Dagi" Pichlik

/* http://dagblog.cz/ Blog pro kodery */

CZJUG Praha - Engineering Machine Learning Algorithms at Scale, Real-time stream data processing

Odpovedet emailem