First off both Elasticsearch and Solr are healthy strong projects with very good products and support. They both have great APIs and tons of features that perform very well in most use cases.
Blur was born before Solr integrated with Hadoop (and before they merged with the Lucene project) and Solr actually took Blur's first version of the block cache. If you are not familiar with the block cache it's basically a OS file system cache replacement to make accessing HDFS perform well enough to function as an index storage system without the need of copying all the indexes locally. Blur has been using a second version of the block cache since then but I don't believe that Solr has ever updated it. I haven't kept up with the Solr project so they may have moved on from the original block cache as well. The two primary goals of Blur that really define the implement were: - Quick search response There are a few features in the query/read side of Blur but not as many as in ES or Solr. One of the reasons for this is that Blur only implemented features that would work at any index size the system could handle (or at least that was the goal). Some of the latest additions to the read side of Blur were the Commands API that allows developers to create there own server side functions where they could access the Lucene indexes directly. Commands were used to perform exports of the data in the index, create facets that always give you a proper count (not just the top N like in Solr) or anything else you could come up with to execute against a Lucene index. Basically they could be used to create new features without the need of a new Thrift call and supporting API changes. There are many other features like document level access control, query cancellation (another feature that Solr adopted), etc. - Massive data ingestion Basically the focus on ingestion was not on latency but rather having the ability to incrementally add large amounts of data to the index that is likely also very large on it's own. The project uses Yarn MR for this and it is not a quick way to bring data but if your needs are to index large chunks of data incrementally it works very well. Also if a full reindex was needed this could done easily as well. Something to point out here is that the MR indexing puts very little strain on the running system to perform the updates/reindexes I believe this differs from how ES and Solr are implemented. Let me know if this doesn't answer your questions or if you want to go into any more detail. Thanks! Aaron On Tue, Jan 3, 2017 at 3:42 PM, Lukáš Vlček <[email protected]> wrote: > Hi, > > What does it mean that Blur's approach "is arguably better" for large data > compared to mentioned competitors? Does it mean faster indexing? Smaller > index size? Better utilization of resources (RAM, CPU, IO) for large data > querying? ... I would be interested in learning more about how it differs > from Elasticsearch and Solr. > > Regards, > Lukáš > > > On Sun, Dec 25, 2016 at 6:30 PM, Aaron McCurry <[email protected]> wrote: > > > It is, but without a community of active developers it has become > > stagnant. For example the Lucene library version it utilizes has become > > outdated and it would likely be a major undertaking to update the code > base > > to the newest version. The biggest reason for the low activity it that I > > haven't had time to work on the project due to personnel reasons. > > > > In it's current state is it very stable even at very large index sizes > > however the upfront development effort to use Blur is very high by > > comparison to ElasticSearch or Solr. I believe this was the primary > reason > > Blur never really caught on in the community. > > > > Aaron > > > > On Sun, Dec 25, 2016 at 12:14 PM, Mark Kerzner <[email protected]> > > wrote: > > > > > But, > > > > > > Isn't Blur a new approach arguably better than SOLR and ElasticSearch > for > > > big sizes? > > > > > > Mark > > >
