Re: Real time vs On demand cluster

Dawid Weiss Fri, 07 Mar 2014 00:00:04 -0800

> But as I said, you don't cluster a document, you might want to recheck your 
> terminology :)


The terminology is fine. The same word applies to two different things
here, hence the confusion. Clustering in terms of infrastructure
arrangement and clustering as in statistical data analysis (or text
analysis).

> Clustering means I wanted to know like I submitted one docs to ES so Indexing 
> will happen at that time. So is it like that clustering of documents will 
> also happens at the same time.

The Carrot2 plugin to ES does post-retrieval document clustering, so
you get clusters for each individual query (and its set of hits). For
this reason the query is also important -- it provides a hint to the
algorithm as to which trivial clusters it should avoid.

An off-line document clustering would have to be executed on all
documents in a collection (index), assign cluster labels and then just
filter these at query time (much like faceting does). Carrot2 does
*not* provide such a functionality (and very likely won't scale to
large indexes). You may want to check out Apache Mahout for this.

Dawid

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAM21Rt-FiGGkYXKYNdJGN3xgipW2kZ3vWVTaGhMbjC4v5PS_Sg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Real time vs On demand cluster

Reply via email to