Re: [Wikidata] WDQS with use of automated requests

2018-05-15 Thread Thad Guidry
Stas,

That is really good info and ideally should also go under
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits

and I would say that even better would be a new page about "Best Practices"
should be made and added under "First Steps" section here
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Wikidata_Query_Help
where that "Best Practices" page would also have a link and blurb about
"query limits" page.

-Thad
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS with use of automated requests

2018-05-15 Thread Stas Malyshev
Hi!

On 5/15/18 3:27 PM, Justin Maltais wrote:
> Hi,
> 
> I am looking for the most efficient way of getting the following
> information out of WDQS:
> 
>  * One language only (e.g. fr.wikipedia.org)
>  * All instances of human (e.g. of the abstraction: wd:Q9916|Dwight
>    David

> Let's say we have a list of all sovereign states (Q16, Q30, Q142, ...)
> and all letters of the requested language (French: a, b, c, ...) , we
> can automate requests and get a lot of results. Unfortunately, it's
> costly and not efficient. It takes about a day to succeed.

The first thing I would like to ask is please don't do that again. This
created a significant load on the server, the script completely ignored
the throttling headers we sent, and in the future we would ban such
clients for extended periods of time, to prevent harm to the service. If
your client can not abide by 429/Retry-After headers, please do not run
it in automated repeated fashion until it either can handle them
properly, or insert delays long enough so you can be sure you are not
launching an avalanche of heavy requests and crowding out other users.

If something takes too long, that's a good moment to ask for help, not
to put it in a loop that would hit the server repeatedly for days.

If you need to deal with a massive data set that needs to be processed,
I would suggest trying the following strategy:

1. Load the primary key data - like list of all humans if that's what
you need - to your own storage. You can use either LDF server or parsing
the dump directly for that for Q5 (maybe with Wikidata Toolkit?). For
some scenarios, even direct query would be fine, but for Q5 it probably
would be too much.

2. Split this data set into palatable batches - like 100 items per batch
or so, you can experiment on that, it's fine to cause a couple of
timeouts if it's not an automated script doing it 20 times a second for
a long time. Once you have sane batch size, run the query that needs to
fetch other data using VALUES clause to substitute primary key data.
Watch the 429 responses - if you're getting them, insert delays or lower
batch size, or ask for help again if it doesn't work.

Alternatively, segmenting the records by some other criteria may work
too, but I don't think filter like STRSTARTS(?personLabel, "D")) is
going to be effective - I don't think Blazegraph query optimizer is
smart enough to convert this to index lookup, and without that, this is
just slowing things down by introducing more checks in the query. And
even if it did, there's a lot of labels starting with "D", so that
probably won't be too useful for speeding it up.

Having said that, I am curious - what exactly you are doing with this
data set? Why you need a list of all humans - how this list is going to
be used? Knowing that may help to devise better specialized strategy of
achieving the same.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WDQS with use of automated requests

2018-05-15 Thread Justin Maltais

Hi,

I am looking for the most efficient way of getting the following 
information out of WDQS:


 * One language only (e.g. fr.wikipedia.org)
 * All instances of human (e.g. of the abstraction: wd:Q9916|Dwight
   David
   
Eisenhower|États-Unis|Dwight|Eisenhower|
   |militaire
   américain, président des États-Unis)


Let's say we have a list of all sovereign states (Q16, Q30, Q142, ...) 
and all letters of the requested language (French: a, b, c, ...) , we 
can automate requests and get a lot of results. Unfortunately, it's 
costly and not efficient. It takes about a day to succeed.


|SELECT ?person ?personLabel ?countryLabel ?givenNameLabel 
?familyNameLabel ?article ?persondesc||

||WHERE||
||{||
||  ?person wdt:P31 wd:Q5;||
||  wdt:P27 wd:Q30;||
||  wdt:P27 ?country;||
||  wdt:P734 ?familyName;||
||  wdt:P735 ?givenName ;||
||  rdfs:label ?personLabel.||
||  ?familyName rdfs:label ?familyNameLabel.||
||  ?country rdfs:label ?countryLabel.||
||  ?givenName rdfs:label ?givenNameLabel.||
||  ?person schema:description ?persondesc.||
||  FILTER(LANG(?personLabel) = "fr").||
||  FILTER(LANG(?familyNameLabel) = "en").||
||  FILTER(LANG(?countryLabel) = "fr").||
||  FILTER(LANG(?givenNameLabel) = "en").||
||  FILTER(LANG(?persondesc) = "fr").||
||  FILTER(STRSTARTS(?personLabel, "D")).||
||  FILTER(STRSTARTS(?familyNameLabel, "E")).||

||  ?article schema:about ?person;||
||   schema:inLanguage "fr";||
||   schema:isPartOf  . ||
||}|


https://query.wikidata.org/#SELECT%20%3Fperson%20%3FpersonLabel%20%3FcountryLabel%20%3FgivenNameLabel%20%3FfamilyNameLabel%20%3Farticle%20%3Fpersondesc%0AWHERE%0A%7B%0A%20%20%3Fperson%20wdt%3AP31%20wd%3AQ5%3B%0A%20%20%20%20%20%20%20%20%20%20%23wdt%3AP21%20wd%3AQ6581097%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP27%20wd%3AQ30%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP27%20%3Fcountry%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP734%20%3FfamilyName%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP735%20%3FgivenName%20%3B%0A%20%20%20%20%20%20%20%20%20%20rdfs%3Alabel%20%3FpersonLabel.%0A%20%20%3FfamilyName%20rdfs%3Alabel%20%3FfamilyNameLabel.%0A%20%20%3Fcountry%20rdfs%3Alabel%20%3FcountryLabel.%0A%20%20%3FgivenName%20rdfs%3Alabel%20%3FgivenNameLabel.%0A%20%20%3Fperson%20schema%3Adescription%20%3Fpersondesc.%0A%20%20FILTER%28LANG%28%3FpersonLabel%29%20%3D%20%22fr%22%29.%0A%20%20FILTER%28LANG%28%3FfamilyNameLabel%29%20%3D%20%22en%22%29.%0A%20%20FILTER%28LANG%28%3FcountryLabel%29%20%3D%20%22fr%22%29.%0A%20%20FILTER%28LANG%28%3FgivenNameLabel%29%20%3D%20%22en%22%29.%0A%20%20FILTER%28LANG%28%3Fpersondesc%29%20%3D%20%22fr%22%29.%0A%20%20FILTER%28STRSTARTS%28%3FpersonLabel%2C%20%22D%22%29%29.%0A%20%20FILTER%28STRSTARTS%28%3FfamilyNameLabel%2C%20%22E%22%29%29.%0A%20%20%0A%20%20%3Farticle%20schema%3Aabout%20%3Fperson%3B%0A%20%20%20%20%20%20%20%20%20%20%20schema%3AinLanguage%20%22fr%22%3B%0A%20%20%20%20%20%20%20%20%20%20%20schema%3AisPartOf%20%3Chttps%3A%2F%2Ffr.wikipedia.org%2F%3E%20.%0A%20%20%23SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22fr%22.%20%7D%0A%20%20%0A%7D%0A%0AORDER%20BY%20%3FfamilyNameLabel

Such a request takes an average of 20 seconds to complete.

Any