efficient ?

Dorian Hoxha Fri, 10 Feb 2017 08:39:00 -0800

@Alex,
I don't know if you've seen it, but there's also redissearch module which
they claim to be faster (ofc less features):
https://redislabs.com/blog/adding-search-engine-redis-adventures-module-land/
http://www.slideshare.net/RedisLabs/redis-for-search
https://github.com/RedisLabsModules/RediSearch


On Fri, Feb 10, 2017 at 1:36 PM, Dorian Hoxha <dorian.ho...@gmail.com>
wrote:

>
>
> On Wed, Feb 8, 2017 at 3:58 PM, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
>> One you filter out the JIRA messages, the forum is very strong and
>> alive. It is just very focused on its purpose - building Solr and
>> Lucene and ElasticSearch.
>>
> Will do just that. Thanks.
>
>>
>> As to "perfection" - nothing is perfect, you can just look at the list
>> of the open JIRAs to confirm that for Lucene and/or Solr. But there is
>> constant improvement and ever-deepening of the features and
>> performance improvement.
>>
>> You can also look at Elasticsearch for inspiration, as they build on
>> Lucene (and are contributing to it) and had a chance to rebuild the
>> layers above it.
>>
> They have more fancy features, but less advanced ones (ex: shard
> splitting!)
>
>>
>> On your question specifically, I think it is hard to answer it well.
>> Partially because I am not sure your assumptions are all that thought
>> out. For example:
>> 1) Different language than Java - Solr relies on Zookeeper, Tika and
>> other libraries. All of those are in Java. Language change implies
>> full change of the dependencies and ecosystem and - without looking -
>> I doubt there is an open-source comprehensive MSWord parser in
>> C++/Rust.
>>
> Usually indexing-speed is not the bottleneck (beside logging and some
> other scenarios) so you could probably use a java service (for tika).
> Zookeeper is again not a bottleneck when serving requests, and you can
> still use it with a non-java db.
>
>> 2) Algolia radix? Lucene uses pre-compiled DFA (deterministic finite
>> automata). Are you sure the open graph chosen because Algolia wants to
>> run on the phone is an improvement on the DFA
>>
> The `suggesters` which are backed by DFA can't be used with normal
> filters/queries which is critical (and algolia-radix can do)
>
>> 3) Document distribution is already customizable with _route_ key,
>> though obviously Maguro algorithm is beyond single key's reach. On the
>> other hand, I am not sure Maguro is designed for good faceting,
>> streaming, enumerations, or other features Lucene/Solr has in its
>> core.
>>
> Yes, seems very special use case.
>
>>
>> As to the rest (GPU!, FPGA), we accept contributions. Including large,
>> complex, interesting contributions (streams, learning to rank,
>> docvalues, etc).
>
> I mean just in the "ideas case", not do it for me.
>
>> And, long term, it is probably more effective to be
>> able to innovate without the well-established framework rather than
>> reinventing things from scratch. After all, even Twitter and LinkedIn
>> built their internal implementations on top of Lucene rather than
>> reinventing absolutely everything.
>>
> Depends how core it is to your comp and how good at low-level your team
> is. Most of the time yes but sometimes you gotta (like the scylladb case,
> they've built A LOT from scratch, like custom scheduler etc)
>
>>
>> Still, Elasticsearch had a - very successful - go at the "Innovator's
>> Dilemma" situation. If you want to create a team trying to
>> rebuild/improve the approaches completely from scratch, I am sure you
>> will find a lot of us looking at your efforts with interest. I, for
>> one, would be happy to point out a new radically-different approach to
>> search engine implementation on my Solr Start mailing list.
>>
> That's why I'm asking for ideas. This is what I got from another dev on
> the same question:  https://news.ycombinator.com/item?id=13249724
> Quote:"Multicores parallel shared nothing architecture like the on in the
> TurboPFor inverted index app and a ram resident inverted index."
>
>
>
>
>> Regards and good luck,
>>    Alex.
>> ----
>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>
>>
>> On 8 February 2017 at 03:39, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
>> > So, am I asking too much (maybe), is this forum dead (then where to ask
>> ?
>> > there is extreme noise here), is lucene perfect(of course not) ?
>> >
>> >
>> > On Wed, Jan 25, 2017 at 5:01 PM, Dorian Hoxha <dorian.ho...@gmail.com>
>> > wrote:
>> >>
>> >> Was thinking also how bing doesn't use posting lists and also compiling
>> >> queries !
>> >> About the queries, I would've think it wouldn't be as high overhead as
>> >> queries in in rdbms since those apply on each row while on search they
>> apply
>> >> on each bitset.
>> >>
>> >>
>> >> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jwar...@whitepages.com>
>> >> wrote:
>> >>>
>> >>>
>> >>>
>> >>> I’ve had some curiosity about this question too.
>> >>>
>> >>>
>> >>>
>> >>> For a while, I watched for a seastar-like library for the JVM, but
>> >>> https://github.com/bestwpw/windmill was the only one I came across,
>> and it
>> >>> doesn’t seem to be going anywhere. Since one of the points of the JVM
>> is to
>> >>> abstract away the platform, I certainty wonder if the JVM will ever
>> get the
>> >>> kinds of machine affinity these other projects see. Your
>> one-shard-per-core
>> >>> could probably be faked with multiple JVMs and numactl - could be an
>> >>> interesting experiment.
>> >>>
>> >>>
>> >>>
>> >>> That said, I’m aware that a phenomenal amount of optimization effort
>> has
>> >>> gone into Lucene, and I’d also be interested in hearing about things
>> that
>> >>> worked well.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> From: Dorian Hoxha <dorian.ho...@gmail.com>
>> >>> Reply-To: "dev@lucene.apache.org" <dev@lucene.apache.org>
>> >>> Date: Friday, January 20, 2017 at 8:12 AM
>> >>> To: "dev@lucene.apache.org" <dev@lucene.apache.org>
>> >>> Subject: How would you architect solr/lucene if you were starting from
>> >>> scratch for them to be 10X+ faster/efficient ?
>> >>>
>> >>>
>> >>>
>> >>> Hi friends,
>> >>>
>> >>> I was thinking how scylladb architecture works compared to cassandra
>> >>> which gives them 10x+ performance and lower latency. If you were
>> starting
>> >>> lucene and solr from scratch what would you do to achieve something
>> similar
>> >>> ?
>> >>>
>> >>> Different language (rust/c++?) for better SIMD ?
>> >>>
>> >>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
>> >>>
>> >>> Make it in-memory and use better data structures?
>> >>>
>> >>> Shard on cores like scylladb (so 1 shard for each core on the
>> machine) ?
>> >>>
>> >>> External cache (like keeping n redis-servers with big ram/network &
>> slow
>> >>> cpu/disk just for cache) ??
>> >>>
>> >>> Use better data structures (like algolia autocomplete radix )
>> >>>
>> >>> Distributing documents by term instead of id ?
>> >>>
>> >>> Using ASIC / FPGA ?
>> >>>
>> >>>
>> >>>
>> >>> Regards,
>> >>>
>> >>> Dorian
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Reply via email to