efficient ?

Alexandre Rafalovitch Fri, 10 Feb 2017 09:03:40 -0800

RedisSearch seems to be fully in-memory and have no analysis or query
chain. Or any real multilingual support. It is pears and apples
comparison and their "big" feature is what Lucene started from (term
list). I don't even see phrase search support, as they don't seem to
implement posting list, just the terms.


Also I don't see them publishing their Elasticsearch or Solr
configuration, which from past experiences is often left untuned.

But yes, good for them. And good for Postgres for adding full-text
search some months ago. Even good for Oracle for having a commercial
(however hardcoded and terrible) full-text search.

I think the summary - in my mind - is that, if software is swallowing
the world than search is swallowing the software. Maybe it will become
that last "Kitchen sink" proof, replacing email. And the more
interesting ideas go around - the better. And some of them, I am sure,
will end up in Lucene/Solr/Elasticsearch, as - after all - they are
the most popular platforms and people will bring those extra things to
the core platform they use, if they really want it.

Regards,
   Alex.


----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 10 February 2017 at 11:38, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> @Alex,
> I don't know if you've seen it, but there's also redissearch module which
> they claim to be faster (ofc less features):
> https://redislabs.com/blog/adding-search-engine-redis-adventures-module-land/
> http://www.slideshare.net/RedisLabs/redis-for-search
> https://github.com/RedisLabsModules/RediSearch
>
> On Fri, Feb 10, 2017 at 1:36 PM, Dorian Hoxha <dorian.ho...@gmail.com>
> wrote:
>>
>>
>>
>> On Wed, Feb 8, 2017 at 3:58 PM, Alexandre Rafalovitch <arafa...@gmail.com>
>> wrote:
>>>
>>> One you filter out the JIRA messages, the forum is very strong and
>>> alive. It is just very focused on its purpose - building Solr and
>>> Lucene and ElasticSearch.
>>
>> Will do just that. Thanks.
>>>
>>>
>>> As to "perfection" - nothing is perfect, you can just look at the list
>>> of the open JIRAs to confirm that for Lucene and/or Solr. But there is
>>> constant improvement and ever-deepening of the features and
>>> performance improvement.
>>>
>>> You can also look at Elasticsearch for inspiration, as they build on
>>> Lucene (and are contributing to it) and had a chance to rebuild the
>>> layers above it.
>>
>> They have more fancy features, but less advanced ones (ex: shard
>> splitting!)
>>>
>>>
>>> On your question specifically, I think it is hard to answer it well.
>>> Partially because I am not sure your assumptions are all that thought
>>> out. For example:
>>> 1) Different language than Java - Solr relies on Zookeeper, Tika and
>>> other libraries. All of those are in Java. Language change implies
>>> full change of the dependencies and ecosystem and - without looking -
>>> I doubt there is an open-source comprehensive MSWord parser in
>>> C++/Rust.
>>
>> Usually indexing-speed is not the bottleneck (beside logging and some
>> other scenarios) so you could probably use a java service (for tika).
>> Zookeeper is again not a bottleneck when serving requests, and you can
>> still use it with a non-java db.
>>>
>>> 2) Algolia radix? Lucene uses pre-compiled DFA (deterministic finite
>>> automata). Are you sure the open graph chosen because Algolia wants to
>>> run on the phone is an improvement on the DFA
>>
>> The `suggesters` which are backed by DFA can't be used with normal
>> filters/queries which is critical (and algolia-radix can do)
>>>
>>> 3) Document distribution is already customizable with _route_ key,
>>> though obviously Maguro algorithm is beyond single key's reach. On the
>>> other hand, I am not sure Maguro is designed for good faceting,
>>> streaming, enumerations, or other features Lucene/Solr has in its
>>> core.
>>
>> Yes, seems very special use case.
>>>
>>>
>>> As to the rest (GPU!, FPGA), we accept contributions. Including large,
>>> complex, interesting contributions (streams, learning to rank,
>>> docvalues, etc).
>>
>> I mean just in the "ideas case", not do it for me.
>>>
>>> And, long term, it is probably more effective to be
>>> able to innovate without the well-established framework rather than
>>> reinventing things from scratch. After all, even Twitter and LinkedIn
>>> built their internal implementations on top of Lucene rather than
>>> reinventing absolutely everything.
>>
>> Depends how core it is to your comp and how good at low-level your team
>> is. Most of the time yes but sometimes you gotta (like the scylladb case,
>> they've built A LOT from scratch, like custom scheduler etc)
>>>
>>>
>>> Still, Elasticsearch had a - very successful - go at the "Innovator's
>>> Dilemma" situation. If you want to create a team trying to
>>> rebuild/improve the approaches completely from scratch, I am sure you
>>> will find a lot of us looking at your efforts with interest. I, for
>>> one, would be happy to point out a new radically-different approach to
>>> search engine implementation on my Solr Start mailing list.
>>
>> That's why I'm asking for ideas. This is what I got from another dev on
>> the same question:  https://news.ycombinator.com/item?id=13249724
>> Quote:"Multicores parallel shared nothing architecture like the on in the
>> TurboPFor inverted index app and a ram resident inverted index."
>>
>>
>>
>>>
>>> Regards and good luck,
>>>    Alex.
>>> ----
>>> http://www.solr-start.com/ - Resources for Solr users, new and
>>> experienced
>>>
>>>
>>> On 8 February 2017 at 03:39, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
>>> > So, am I asking too much (maybe), is this forum dead (then where to ask
>>> > ?
>>> > there is extreme noise here), is lucene perfect(of course not) ?
>>> >
>>> >
>>> > On Wed, Jan 25, 2017 at 5:01 PM, Dorian Hoxha <dorian.ho...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Was thinking also how bing doesn't use posting lists and also
>>> >> compiling
>>> >> queries !
>>> >> About the queries, I would've think it wouldn't be as high overhead as
>>> >> queries in in rdbms since those apply on each row while on search they
>>> >> apply
>>> >> on each bitset.
>>> >>
>>> >>
>>> >> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jwar...@whitepages.com>
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>> I’ve had some curiosity about this question too.
>>> >>>
>>> >>>
>>> >>>
>>> >>> For a while, I watched for a seastar-like library for the JVM, but
>>> >>> https://github.com/bestwpw/windmill was the only one I came across,
>>> >>> and it
>>> >>> doesn’t seem to be going anywhere. Since one of the points of the JVM
>>> >>> is to
>>> >>> abstract away the platform, I certainty wonder if the JVM will ever
>>> >>> get the
>>> >>> kinds of machine affinity these other projects see. Your
>>> >>> one-shard-per-core
>>> >>> could probably be faked with multiple JVMs and numactl - could be an
>>> >>> interesting experiment.
>>> >>>
>>> >>>
>>> >>>
>>> >>> That said, I’m aware that a phenomenal amount of optimization effort
>>> >>> has
>>> >>> gone into Lucene, and I’d also be interested in hearing about things
>>> >>> that
>>> >>> worked well.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> From: Dorian Hoxha <dorian.ho...@gmail.com>
>>> >>> Reply-To: "dev@lucene.apache.org" <dev@lucene.apache.org>
>>> >>> Date: Friday, January 20, 2017 at 8:12 AM
>>> >>> To: "dev@lucene.apache.org" <dev@lucene.apache.org>
>>> >>> Subject: How would you architect solr/lucene if you were starting
>>> >>> from
>>> >>> scratch for them to be 10X+ faster/efficient ?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Hi friends,
>>> >>>
>>> >>> I was thinking how scylladb architecture works compared to cassandra
>>> >>> which gives them 10x+ performance and lower latency. If you were
>>> >>> starting
>>> >>> lucene and solr from scratch what would you do to achieve something
>>> >>> similar
>>> >>> ?
>>> >>>
>>> >>> Different language (rust/c++?) for better SIMD ?
>>> >>>
>>> >>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
>>> >>>
>>> >>> Make it in-memory and use better data structures?
>>> >>>
>>> >>> Shard on cores like scylladb (so 1 shard for each core on the
>>> >>> machine) ?
>>> >>>
>>> >>> External cache (like keeping n redis-servers with big ram/network &
>>> >>> slow
>>> >>> cpu/disk just for cache) ??
>>> >>>
>>> >>> Use better data structures (like algolia autocomplete radix )
>>> >>>
>>> >>> Distributing documents by term instead of id ?
>>> >>>
>>> >>> Using ASIC / FPGA ?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Regards,
>>> >>>
>>> >>> Dorian
>>> >>
>>> >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Reply via email to