Re: [Cloud] Wikidata query request

Lucas Werkmeister Wed, 14 Mar 2018 09:39:57 -0700

Sorry, I’m not sure what you mean about the query service being slow? I was
able to fetch all 3.5M results with the following command:


$ time curl --silent --header 'Accept: application/json' --get
--data-urlencode 'query=SELECT ?entity ?property_value WHERE { ?entity
wdt:P1566 ?property_value. }' https://query.wikidata.org/sparql | xz >
/tmp/geonames.xz

real    2m28,161s
user    2m27,023s
sys     0m4,334s

$ time unxz < /tmp/geonames.xz | jq '.results.bindings | length'
3495429

real    0m33,030s
user    0m33,467s
sys     0m4,702s

Sure, it took a few minutes to download, but that’s not unexpected for a
large result set :) but as long as you don’t use LIMIT and OFFSET, the
query is quite efficient, and it’s only bound by how fast the query service
can pump out the data. (If it was really computationally expensive, it
would be killed after sixty seconds anyways.)

2018-03-14 17:34 GMT+01:00 Huji Lee <[email protected]>:

> Actually, never mind. I reviewed the Java code behind it and it doesn't
> support more items per page. It also gets slow when you look at later pages
> (first few pages are in a warm cache and are fast).
>
> I think my best bet is to just download the latest JSON dump from
> https://www.wikidata.org/wiki/Wikidata:Database_download and parse it
> myself.
>
> Thanks again!
>
> Huji
>
> On Wed, Mar 14, 2018 at 12:12 PM, Huji Lee <[email protected]> wrote:
>
>> Lucas,
>>
>> No I don't need the page_id. The other two are enough.
>>
>> Wikidata Query Service seems very slow (it'll take about one day of
>> continuous querying to get all the data). Linked Data Fragments server
>> seems faster, but I wish I knew how to make it return more than 100 results
>> at a time. Do you?
>>
>> Thanks,
>>
>> Huji
>>
>> On Wed, Mar 14, 2018 at 7:00 AM, Lucas Werkmeister <
>> [email protected]> wrote:
>>
>>> Huji, do you need the page_id in the query results? Otherwise, I would
>>> suggest using either the Wikidata Query Service, as Jaime suggested (though
>>> I’d omit the LIMIT and OFFSET – I think it’s better to let the server send
>>> you all the results at once) or the Linked Data Fragments server:
>>> https://query.wikidata.org/bigdata/ldf?subject=&predicate=ht
>>> tp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2FP1566&object= (this URL
>>> will return HTML, RDF XML, Turtle, LD-JSON, … depending on Accept header).
>>>
>>> Cheers,
>>> Lucas
>>>
>>> 2018-03-14 1:03 GMT+01:00 Huji Lee <[email protected]>:
>>>
>>>> Thanks, Jaime, for your recommendation.
>>>>
>>>> If I understand the result of [1] correctly, there are around 3.5
>>>> million pages with a GeoNames property specified on Wikidata. I'm sure some
>>>> of them are redirects, or not cities, etc. But still, going through
>>>> millions of pages through API calls of 1000 at a time is cumbersome and
>>>> inefficient. (The example you gave takes 20 seconds to run; that would mean
>>>> a total of 20 * 3.5 * 1000 seconds which is like 19 hours, assuming no lag
>>>> or error).
>>>>
>>>> However, what you suggested gave me an idea: I can take a look at the
>>>> code for the Api itself (I guess it is at [2]) and figure out how the query
>>>> is written there, then try to write a similar query on my own. If I figure
>>>> it out, I will report back here.
>>>>
>>>> Huji
>>>>
>>>> [1] https://quarry.wmflabs.org/query/25418
>>>> [2] https://phabricator.wikimedia.org/diffusion/EWBA/browse/mast
>>>> er/client/includes/Api/ApiPropsEntityUsage.php
>>>>
>>>> On Tue, Mar 13, 2018 at 5:39 AM, Jaime Crespo <[email protected]>
>>>> wrote:
>>>>
>>>>> I am not 100% sure there is a perfect way to do what you want by
>>>>> querying the metadata databases (I assume that is what you mean with
>>>>> query)- I don't think that data is metadata, but content itself, which is
>>>>> not on the metadata databases.
>>>>>
>>>>> Calling the wikidata query service is probably what you want:
>>>>>
>>>>> <https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel
>>>>> %20%3Fgeoname%0AWHERE%20%7B%0A%09%3Fitem%20wdt%3AP1566%20%3F
>>>>> geoname%20.%0A%09SERVICE%20wikibase%3Alabel%20%7B%20bd%3Aser
>>>>> viceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Ce
>>>>> n%22%20%7D%0A%7D%0ALIMIT%201000%20OFFSET%201000>
>>>>>
>>>>> Note the LIMIT and OFFSET that will let you iterate over the dataset
>>>>> (a where close would be faster).
>>>>>
>>>>> There is a way to get results, which is iterating over:
>>>>> <https://www.wikidata.org/w/index.php?title=Special:WhatLink
>>>>> sHere/Property:P1566&hidetrans=1&hideredirs=1>
>>>>>
>>>>> That is a standard mediawiki api query, you will also find this on the
>>>>> pagelinks table, but you should check every page you get afterwards (by
>>>>> retrieving its contents), as it could include false positives or be behind
>>>>> on updates.
>>>>>
>>>>> On Sun, Mar 11, 2018 at 3:44 PM, Huji Lee <[email protected]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I need help writing a query that I would like to run on the Clouds.
>>>>>> The goal of the query is to retrieve the following information from
>>>>>> wikidatawiki_p:
>>>>>>
>>>>>> * Find all pages that have a claim for the property P1566, for
>>>>>> example see https://www.wikidata.org/wiki/Q2113430
>>>>>> * Find out what is the value of their P1566 property (in this case,
>>>>>> 18918)
>>>>>>
>>>>>> Output format should be like this:
>>>>>>
>>>>>> page_id       entity            property_value
>>>>>> 2039804      Q2113430     18918
>>>>>> ...
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> Huji
>>>>>>
>>>>>> _______________________________________________
>>>>>> Wikimedia Cloud Services mailing list
>>>>>> [email protected] (formerly [email protected])
>>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jaime Crespo
>>>>> <http://wikimedia.org>
>>>>>
>>>>> _______________________________________________
>>>>> Wikimedia Cloud Services mailing list
>>>>> [email protected] (formerly [email protected])
>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Wikimedia Cloud Services mailing list
>>>> [email protected] (formerly [email protected])
>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>
>>>
>>>
>>>
>>> --
>>> Lucas Werkmeister
>>> Software Developer (Intern)
>>>
>>> Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
>>> Phone: +49 (0)30 219 158 26-0
>>> https://wikimedia.de
>>>
>>> Imagine a world, in which every single human being can freely share in
>>> the sum of all knowledge. That‘s our commitment.
>>>
>>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>>> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
>>> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
>>> Körperschaften I Berlin, Steuernummer 27/029/42207.
>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> [email protected] (formerly [email protected])
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>>
>>
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> [email protected] (formerly [email protected])
> https://lists.wikimedia.org/mailman/listinfo/cloud
>



-- 
Lucas Werkmeister
Software Developer (Intern)

Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de

Imagine a world, in which every single human being can freely share in the
sum of all knowledge. That‘s our commitment.

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.

_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] Wikidata query request

Reply via email to