[jira] [Commented] (SAMZA-423) Integrate Lucene into Samza

Martin Kleppmann (JIRA) Mon, 10 Nov 2014 11:47:27 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205218#comment-14205218
 ]


Martin Kleppmann commented on SAMZA-423:
----------------------------------------

Today [~romseygeek] and I hacked together a proof-of-concept stream searching 
system using [Luwak|https://github.com/flaxsearch/luwak] (a Lucene wrapper with 
optimisations for matching individual documents against a large set of stored 
queries) and Samza: https://github.com/romseygeek/samza-luwak

This experiment focuses on matching incoming messages against a set of queries 
(the second mode mentioned above), not on building an index of documents. It 
seems to be working pretty well, and the code is reasonably simple (but it's 
still far from production-ready).

The biggest issue we ran into was that we wanted queries to be partitioned, but 
documents to be broadcast to all partitions. I've added a [comment about 
this|https://issues.apache.org/jira/browse/SAMZA-353?focusedCommentId=14205216&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14205216]
 on SAMZA-353.

> Integrate Lucene into Samza
> ---------------------------
>
>                 Key: SAMZA-423
>                 URL: https://issues.apache.org/jira/browse/SAMZA-423
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Martin Kleppmann
>              Labels: project
>
> At the moment Samza only has a key-value storage engine 
> (LevelDB/RocksDB/in-memory), but Samza's state abstraction is designed to be 
> more general than that. In particular, we've discussed integrating Lucene in 
> order to support full-text indexes.
> There are two modes of using Lucene which would make sense in a stream 
> processing system:
> * Treat incoming messages as documents to be added to an index. This would be 
> akin to the key-value storage model, but with much richer indexing 
> capabilities. It would enable joining streams not just on a single key, but 
> on more complex criteria (arbitrary boolean expressions for joins) and stuff 
> like deduplicating similar documents (which you may want e.g. if implementing 
> a web crawler).
> * Match incoming messages against a set of queries, where the queries are 
> more or less fixed (perhaps the queries are updated via another input 
> stream). In this case, the message isn't added to a persistent index, but 
> it's only analysed and matched against a query as it flows through the stream 
> processor. This is useful for monitoring an activity stream for events of 
> interest ("notify me whenever a news article mentions my company name"). It's 
> perhaps comparable to ElasticSearch's 
> [percolator|http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html].
> I was chatting to [~romseygeek] yesterday, a committer on Lucene/Solr. We 
> have a vague plan to hack on a proof of concept to see what an integration of 
> Lucene and Samza could look like. This is a placeholder ticket for collecting 
> any stuff related to that experiment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-423) Integrate Lucene into Samza

Reply via email to