efficient ?

Dorian Hoxha Fri, 10 Feb 2017 04:37:43 -0800

On Wed, Feb 8, 2017 at 3:58 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:


> One you filter out the JIRA messages, the forum is very strong and
> alive. It is just very focused on its purpose - building Solr and
> Lucene and ElasticSearch.
>
Will do just that. Thanks.

>
> As to "perfection" - nothing is perfect, you can just look at the list
> of the open JIRAs to confirm that for Lucene and/or Solr. But there is
> constant improvement and ever-deepening of the features and
> performance improvement.
>
> You can also look at Elasticsearch for inspiration, as they build on
> Lucene (and are contributing to it) and had a chance to rebuild the
> layers above it.
>
They have more fancy features, but less advanced ones (ex: shard splitting!)

>
> On your question specifically, I think it is hard to answer it well.
> Partially because I am not sure your assumptions are all that thought
> out. For example:
> 1) Different language than Java - Solr relies on Zookeeper, Tika and
> other libraries. All of those are in Java. Language change implies
> full change of the dependencies and ecosystem and - without looking -
> I doubt there is an open-source comprehensive MSWord parser in
> C++/Rust.
>
Usually indexing-speed is not the bottleneck (beside logging and some other
scenarios) so you could probably use a java service (for tika).
Zookeeper is again not a bottleneck when serving requests, and you can
still use it with a non-java db.

> 2) Algolia radix? Lucene uses pre-compiled DFA (deterministic finite
> automata). Are you sure the open graph chosen because Algolia wants to
> run on the phone is an improvement on the DFA
>
The `suggesters` which are backed by DFA can't be used with normal
filters/queries which is critical (and algolia-radix can do)

> 3) Document distribution is already customizable with _route_ key,
> though obviously Maguro algorithm is beyond single key's reach. On the
> other hand, I am not sure Maguro is designed for good faceting,
> streaming, enumerations, or other features Lucene/Solr has in its
> core.
>
Yes, seems very special use case.

>
> As to the rest (GPU!, FPGA), we accept contributions. Including large,
> complex, interesting contributions (streams, learning to rank,
> docvalues, etc).

I mean just in the "ideas case", not do it for me.

> And, long term, it is probably more effective to be
> able to innovate without the well-established framework rather than
> reinventing things from scratch. After all, even Twitter and LinkedIn
> built their internal implementations on top of Lucene rather than
> reinventing absolutely everything.
>
Depends how core it is to your comp and how good at low-level your team is.
Most of the time yes but sometimes you gotta (like the scylladb case,
they've built A LOT from scratch, like custom scheduler etc)

>
> Still, Elasticsearch had a - very successful - go at the "Innovator's
> Dilemma" situation. If you want to create a team trying to
> rebuild/improve the approaches completely from scratch, I am sure you
> will find a lot of us looking at your efforts with interest. I, for
> one, would be happy to point out a new radically-different approach to
> search engine implementation on my Solr Start mailing list.
>
That's why I'm asking for ideas. This is what I got from another dev on the
same question:  https://news.ycombinator.com/item?id=13249724
Quote:"Multicores parallel shared nothing architecture like the on in the
TurboPFor inverted index app and a ram resident inverted index."




> Regards and good luck,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 8 February 2017 at 03:39, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> > So, am I asking too much (maybe), is this forum dead (then where to ask ?
> > there is extreme noise here), is lucene perfect(of course not) ?
> >
> >
> > On Wed, Jan 25, 2017 at 5:01 PM, Dorian Hoxha <dorian.ho...@gmail.com>
> > wrote:
> >>
> >> Was thinking also how bing doesn't use posting lists and also compiling
> >> queries !
> >> About the queries, I would've think it wouldn't be as high overhead as
> >> queries in in rdbms since those apply on each row while on search they
> apply
> >> on each bitset.
> >>
> >>
> >> On Mon, Jan 23, 2017 at 6:04 PM, Jeff Wartes <jwar...@whitepages.com>
> >> wrote:
> >>>
> >>>
> >>>
> >>> I’ve had some curiosity about this question too.
> >>>
> >>>
> >>>
> >>> For a while, I watched for a seastar-like library for the JVM, but
> >>> https://github.com/bestwpw/windmill was the only one I came across,
> and it
> >>> doesn’t seem to be going anywhere. Since one of the points of the JVM
> is to
> >>> abstract away the platform, I certainty wonder if the JVM will ever
> get the
> >>> kinds of machine affinity these other projects see. Your
> one-shard-per-core
> >>> could probably be faked with multiple JVMs and numactl - could be an
> >>> interesting experiment.
> >>>
> >>>
> >>>
> >>> That said, I’m aware that a phenomenal amount of optimization effort
> has
> >>> gone into Lucene, and I’d also be interested in hearing about things
> that
> >>> worked well.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Dorian Hoxha <dorian.ho...@gmail.com>
> >>> Reply-To: "dev@lucene.apache.org" <dev@lucene.apache.org>
> >>> Date: Friday, January 20, 2017 at 8:12 AM
> >>> To: "dev@lucene.apache.org" <dev@lucene.apache.org>
> >>> Subject: How would you architect solr/lucene if you were starting from
> >>> scratch for them to be 10X+ faster/efficient ?
> >>>
> >>>
> >>>
> >>> Hi friends,
> >>>
> >>> I was thinking how scylladb architecture works compared to cassandra
> >>> which gives them 10x+ performance and lower latency. If you were
> starting
> >>> lucene and solr from scratch what would you do to achieve something
> similar
> >>> ?
> >>>
> >>> Different language (rust/c++?) for better SIMD ?
> >>>
> >>> Use a GPU with a SSD for posting-list intersection ?(not out yet)
> >>>
> >>> Make it in-memory and use better data structures?
> >>>
> >>> Shard on cores like scylladb (so 1 shard for each core on the machine)
> ?
> >>>
> >>> External cache (like keeping n redis-servers with big ram/network &
> slow
> >>> cpu/disk just for cache) ??
> >>>
> >>> Use better data structures (like algolia autocomplete radix )
> >>>
> >>> Distributing documents by term instead of id ?
> >>>
> >>> Using ASIC / FPGA ?
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Dorian
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: How would you architect solr/lucene if you were starting from scratch for them to be 10X+ faster/efficient ?

Reply via email to