On Thu, Oct 13, 2016 at 11:36 AM, Jaime Crespo <jcre...@wikimedia.org> wrote:
> 3) Move the service to labs, not providing any firm guarantee of service
> level ?
> Labs is not the place where bad services go to die. Production is the place
> where only very stable services reach so they can be properly managed.
> "WDQS do not go through any critical systems"
> "all direct clients of WDQS are well protected by circuit breakers"
> Why using the production network, then?
> I think there was one exception, which is services that needed a lot of
> resources so they could not run on vms, but don't we have a prototype of
> "labs on real hardware"?

I'm not sure why WDQS is in the production network (this predates me
joining WMF). It is probably there as you suggest for the real
hardware needs. There was also probably a wish to make WDQS a
production level service with all the availability garanties that goes
with it, even if that goal is probably not possible with the current
way WDQS works.

> Letting user run arbitrary queries is a problem for security, but not in the
> common sense (sql injection), but in terms of exactly the situation that you
> are describing- running easily out of resources (DOS). Even quarry, which I
> have publicly complained about in the past, for what you say, has a better
> resource management than wqs (30-minute limit execution, concurrency
> control, etc.).

(I did not know about quarry, I need to have a look!)

The main power of wdqs is that do allow users to write arbitrary
queries synchronously. With that power comes the ability to break the
service. Removing this ability greatly reduces the value of wdqs. We
can (and should) work on putting in place constraints to protect the
service, but there are limits to what is possible. I'm pretty sure
that whatever we put in place, it will still be possible to break that
service (unless we invest crazy amount of time, energy, ...). No, I
don't know for sure...

> I do not think maps is a problem, as after all it is static tile requests
> mostly (the worst it could happen is having a lot of requests)- the only
> complain there is that it is constantly creating noise on icinga. But
> running an unstable service (wdqs) on top of another unstable service
> (wikidata data handling) will never be stable. Everytime a bot starts
> writing to wikidata 600 times per second, s5 dbs shake (that is why we are
> creating s8) and wqs goes down. :-)

I don't think this assumption is true. I have some experience running
available services on top of unavailable services. At JOB^1, we did
use quite a few external service which were not all that great in term
of robustness. Payment processors and credit check services are good
example of external notoriously flaky services once you put some load
on them. There are strategies to make that work, and in the end taking
into account that your dependencies can fail is a great way to build
much more robust services. I would even go as far as saying that
making sure that your dependencies fail often is a good way to ensure
that your system is robust. No, wdqs is not robust enough, but it is
something that can (and should) be fixed without changing the way we

> I would suggest using wqs on labs (or anywhere, non-production) with regular
> imports rather than real-time updates. Less headaches. I am literally aiming
> for that for labsdbs, too.

In the specific case of wdqs integration with wikidata, I don't think
that the integration pattern itself is wrong (on the fly import of
wikidata to wdqs). It does needs some work to improve robustness
(https://phabricator.wikimedia.org/T139445 comes to mind). And it does
fulfil one of the important use case of wdqs: quite a few wikidata
editors use wdqs to live check edits / imports to wikidata.

I agree with you that in its current state, WDQS is probably closer to
a labs service than to a production service (as far as I understand
the definition of labs and production here). The question I'm trying
to ask is how do we start using wdqs in a production context. I fully
understand that there is work to do here. This is not something that
will happen in a few days. But there is value in this idea, so we
should start looking at what path we want to take (or make sure that
there is no path worth taking, this is a perfectly acceptable answer,
as long we look hard enough first).

My assumption is that it makes more sense to learn how to integrate
low reliability services in a production context than it does to make
sure wdqs becomes highly reliable. We should still work on improving
wdqs reliability, but we should accept that by its nature it will be
less reliable than most of the production services that we have.

By the way, thanks Jaime for the great writing! It does help me to
structure quite a bit the random thoughts I have between my two ears!

> On Tue, Oct 11, 2016 at 10:37 PM, Guillaume Lederrey
> <gleder...@wikimedia.org> wrote:
>> Hello!
>> There is some discussion of starting to use WDQS in conjunction with
>> maps and graphs. Here are a few thoughts, just to put them out there
>> and to start getting some feedback. This is an attempt to put some
>> order in my thoughts, there are not complete yet...
>> WDQS exposes a SPARQL endpoint to users. This can be compared to
>> giving the ability to our users to write arbitrary SQL queries. This
>> is fairly close to the concept of the labs replica databases. Giving
>> direct access to a SPARQL endpoint is at the same time a wonderful
>> idea (it allows users to use WDQS in ways we would never have imagine)
>> and a very scary idea (users can write complex queries which will
>> consume all resources on our servers - which does happen from time to
>> time).
>> At the moment, WDQS is used by researcher, bots and power users. Those
>> users understand this constraint well, and the fluctuation of
>> performance of WDQS is not a major issue.
>> Making WDQS robust enough while letting user run arbitrary queries is
>> most probably extremely hard. I think that we should instead
>> investigate how to use an unstable service from a stable one.
>> Ideas...
>> 1) We can accept service degradation of specific functionalities. We
>> accept that WDQS is down, or slow some times. In this case, we degrade
>> user experience, graphs will not work, maps will not display data
>> layers. In term of implementation, we need to ensure that data flows
>> involving WDQS do not go through any critical systems, and that all
>> direct clients of WDQS are well protected by circuit breakers.
>> 2) We want to conserve user experience. We go fully async. Graphs and
>> maps are pre-generated and updated regularly outside of user
>> interaction. We probably still need synchronous access for editors, to
>> allow them to test their edits. Refresh can be relatively low
>> frequency (1/day or maybe less). We can probably optimize this based
>> on how often a specific graph / map is viewed. I'm not sure how easy
>> it would be to scale such an approach...
>> 3) Something else?
>> Time to get some sleep...
>>   MrG
>> --
>> Guillaume Lederrey
>> Operations Engineer, Discovery
>> Wikimedia Foundation
>> UTC+2 / CEST
>> _______________________________________________
>> Ops mailing list
>> o...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/ops
> --
> Jaime Crespo
> <http://wikimedia.org>

Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation

discovery mailing list

Reply via email to