Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Jasper Koehorst
We are actually planning to buy a new barebone server and they are around 
E2500,-. With barely any memory. I will check later with sales, 16gig ram 
strips are around max 200 euros so below 10K should be sufficient?


> On 1 Nov 2017, at 07:47, Laura Morales  wrote:
> 
>> It's a machine with 378 GiB of RAM and 64 threads running Scientific
>> Linux 7.2, that we use mainly for benchmarks.
>> 
>> Building the index was really all about memory because the CPUs have
>> actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
>> to those of my regular workstation, which was unable to build it.
> 
> 
> If your regular workstation was using more CPU, I guess it was because of 
> swapping. Thanks for the statistics, it means a "commodity" CPU could handle 
> this fine, the bottleneck is RAM. I wonder how expensive it is to buy a 
> machine like yours... it sounds like in the $30K-$50K range?
> 
> 
>> You're right. The limited query language of hdtSearch is closer to
>> grep than to SPARQL.
>> 
>> Thank you for pointing out Fuseki, I'll have a look at it.
> 
> 
> I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist 
> (yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, 
> although my HDT files are all small. Feel free to drop me an email if you 
> need any help setting up Fuseki.
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Jasper Koehorst
Hello,

I am currently downloading the latest ttl file. On a 250gig ram machine. I will 
see if that is sufficient to run the conversion Otherwise we have another busy 
one with  around 310 gig.
For querying I use the Jena query engine. I have created a module called 
HDTQuery located http://download.systemsbiology.nl/sapp/ 
 which is a simple program and under 
development that should be able to use the full power of SPARQL and be more 
advanced than grep… ;)

If this all works out I will see with our department if we can set up if it is 
still needed a weekly cron job to convert the TTL file. But as it is growing 
rapidly we might run into memory issues later?





> On 1 Nov 2017, at 00:32, Stas Malyshev  wrote:
> 
> Hi!
> 
>> OK. I wonder though, if it would be possible to setup a regular HDT
>> dump alongside the already regular dumps. Looking at the dumps page,
>> https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a
>> new dump is generated once a week more or less. So if a HDT dump
>> could
> 
> True, the dumps run weekly. "More or less" situation can arise only if
> one of the dumps fail (either due to a bug or some sort of external
> force majeure).
> -- 
> Stas Malyshev
> smalys...@wikimedia.org
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales
> It's a machine with 378 GiB of RAM and 64 threads running Scientific
> Linux 7.2, that we use mainly for benchmarks.
> 
> Building the index was really all about memory because the CPUs have
> actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
> to those of my regular workstation, which was unable to build it.


If your regular workstation was using more CPU, I guess it was because of 
swapping. Thanks for the statistics, it means a "commodity" CPU could handle 
this fine, the bottleneck is RAM. I wonder how expensive it is to buy a machine 
like yours... it sounds like in the $30K-$50K range?


> You're right. The limited query language of hdtSearch is closer to
> grep than to SPARQL.
> 
> Thank you for pointing out Fuseki, I'll have a look at it.


I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist 
(yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, 
although my HDT files are all small. Feel free to drop me an email if you need 
any help setting up Fuseki.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread sushil dutt
Please take me out from these conversations.

On Wed, Nov 1, 2017 at 5:02 AM, Stas Malyshev 
wrote:

> Hi!
>
> > OK. I wonder though, if it would be possible to setup a regular HDT
> > dump alongside the already regular dumps. Looking at the dumps page,
> > https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a
> > new dump is generated once a week more or less. So if a HDT dump
> > could
>
> True, the dumps run weekly. "More or less" situation can arise only if
> one of the dumps fail (either due to a bug or some sort of external
> force majeure).
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
Regards,
Sushil Dutt
8800911840
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Stas Malyshev
Hi!

> OK. I wonder though, if it would be possible to setup a regular HDT
> dump alongside the already regular dumps. Looking at the dumps page,
> https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a
> new dump is generated once a week more or less. So if a HDT dump
> could

True, the dumps run weekly. "More or less" situation can arise only if
one of the dumps fail (either due to a bug or some sort of external
force majeure).
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Jérémie Roquet
2017-10-31 21:27 GMT+01:00 Laura Morales :
>> I've just loaded the provided hdt file on a big machine (32 GiB wasn't
> enough to build the index but ten times this is more than enough)
> Could you please share a bit about your setup? Do you have a machine with 
> 320GB of RAM?

It's a machine with 378 GiB of RAM and 64 threads running Scientific
Linux 7.2, that we use mainly for benchmarks.

Building the index was really all about memory because the CPUs have
actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
to those of my regular workstation, which was unable to build it.

> Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd 
> be interested to read your results on this too.

As I'm also looking for up-to-date results, so I plan do it with the
last turtle dump as soon as I have a time slot for it; I'll let you
know about the outcome.

>> I'll try to run a few queries to see how it behaves.
>
> I don't think there is a command-line tool to parse SPARQL queries, so you 
> probably have to setup a Fuseki endpoint which uses HDT as a data source.

You're right. The limited query language of hdtSearch is closer to
grep than to SPARQL.

Thank you for pointing out Fuseki, I'll have a look at it.

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales
> I've just loaded the provided hdt file on a big machine (32 GiB wasn't
enough to build the index but ten times this is more than enough)


Could you please share a bit about your setup? Do you have a machine with 320GB 
of RAM?
Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd 
be interested to read your results on this too.
Thank you!


> I'll try to run a few queries to see how it behaves.


I don't think there is a command-line tool to parse SPARQL queries, so you 
probably have to setup a Fuseki endpoint which uses HDT as a data source.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Ghislain ATEMEZING
Hola,
Please don’t get me wrong and don’t give any interpretation based on my 
question. 
Since the beginning of this thread, I am also trying to push the use of HDT 
here. For example, I was the one contacting HDT gurus to fix the dataset error 
on Twitter and so on...

Sorry if Laura or any one thought I was giving “some lessons here “. I don’t 
have a super computer either nor a member of Wikidata team. Just a “data 
consumer” as many here ..

Best,
Ghislain 

Sent from my iPhone, may include typos

> Le 31 oct. 2017 à 20:44, Luigi Assom  a écrit :
> 
> Doh what's wrong with asking for supporting own user case "UC" ?
> 
> I think it is a totally legit question to ask, and that's why this thread 
> exists.
> 
> Also, I do support for possibility to help access to data that would be hard 
> to process from "common" hardware. Especially in the case of open data.
> They exists to allow someone take them and build them - amazing if can 
> prototype locally, right?
> 
> I don't like the use case where a data-scientist-or-IT show to the other 
> data-scientist-or-IT own work looking for emotional support or praise.
> I've seen that, not here, and I hope this attitude stays indeed out from 
> here..
> 
> I do like when the work of data-scientist-or-IT ignites someone else's 
> creativity - someone who is completely external - , to say: hey your work is 
> cool and I wanna use it for... my use case!
> That's how ideas go around and help other people build complexity over them, 
> without constructing not necessary borders.
> 
> About a local version of compressed, index RDF - I think that if was 
> available, more people yes probably would use it.
> 
> 
> 
>> On Tue, Oct 31, 2017 at 4:03 PM, Laura Morales  wrote:
>> I feel like you are misrepresenting my request, and possibly trying to 
>> offend me as well.
>> 
>> My "UC" as you call it, is simply that I would like to have a local copy of 
>> wikidata, and query it using SPARQL. Everything that I've tried so far 
>> doesn't seem to work on commodity hardware since the database is so large. 
>> But HDT could work. So I asked if a HDT dump could, please, be added to 
>> other dumps that are periodically generated by wikidata. I also told you 
>> already that *I AM* trying to use the 1 year old dump, but in order to use 
>> the HDT tools I'm told that I *MUST* generate some other index first which 
>> unfortunately I can't generate for the same reasons that I can convert the 
>> Turtle to HDT. So what I was trying to say is, that if wikidata were to add 
>> any HDT dump, this dump should contain both the .hdt file and .hdt.index in 
>> order to be useful. That's about it, and it's not just about me. Anybody who 
>> wants to have a local copy of wikidata could benefit from this, since 
>> setting up a .hdt file seems much easier than a Turtle dump. And I don't 
>> understand why you're trying to blame me for this?
>> 
>> If you are part of the wikidata dev team, I'd greatly appreciate a 
>> "can/can't" or "don't care" response rather than playing the 
>> passive-aggressive game that you displayed in your last email.
>> 
>> 
>> > Let me try to understand ...
>> > You are a "data consumer" with the following needs:
>> >   - Latest version of the data
>> >   - Quick access to the data
>> >   - You don't want to use the current ways to access the data by the 
>> > publisher (endpoint, ttl dumps, LDFragments)
>> >  However, you ask for a binary format (HDT), but you don't have enough 
>> > memory to set up your own environment/endpoint due to lack of memory.
>> > For that reason, you are asking the publisher to support both .hdt and 
>> > .hdt.index files.
>> >
>> > Do you think there are many users with your current UC?
>> 
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Luigi Assom
Doh what's wrong with asking for supporting own user case "UC" ?

I think it is a totally legit question to ask, and that's why this thread
exists.

Also, I do support for possibility to help access to data that would be
hard to process from "common" hardware. Especially in the case of open data.
They exists to allow someone take them and build them - amazing if can
prototype locally, right?

I don't like the use case where a data-scientist-or-IT show to the other
data-scientist-or-IT own work looking for emotional support or praise.
I've seen that, not here, and I hope this attitude stays indeed out from
here..

I do like when the work of data-scientist-or-IT ignites someone else's
creativity - someone who is completely external - , to say: hey your work
is cool and I wanna use it for... my use case!
That's how ideas go around and help other people build complexity over
them, without constructing not necessary borders.

About a local version of compressed, index RDF - I think that if was
available, more people yes probably would use it.



On Tue, Oct 31, 2017 at 4:03 PM, Laura Morales  wrote:

> I feel like you are misrepresenting my request, and possibly trying to
> offend me as well.
>
> My "UC" as you call it, is simply that I would like to have a local copy
> of wikidata, and query it using SPARQL. Everything that I've tried so far
> doesn't seem to work on commodity hardware since the database is so large.
> But HDT could work. So I asked if a HDT dump could, please, be added to
> other dumps that are periodically generated by wikidata. I also told you
> already that *I AM* trying to use the 1 year old dump, but in order to use
> the HDT tools I'm told that I *MUST* generate some other index first which
> unfortunately I can't generate for the same reasons that I can convert the
> Turtle to HDT. So what I was trying to say is, that if wikidata were to add
> any HDT dump, this dump should contain both the .hdt file and .hdt.index in
> order to be useful. That's about it, and it's not just about me. Anybody
> who wants to have a local copy of wikidata could benefit from this, since
> setting up a .hdt file seems much easier than a Turtle dump. And I don't
> understand why you're trying to blame me for this?
>
> If you are part of the wikidata dev team, I'd greatly appreciate a
> "can/can't" or "don't care" response rather than playing the
> passive-aggressive game that you displayed in your last email.
>
>
> > Let me try to understand ...
> > You are a "data consumer" with the following needs:
> >   - Latest version of the data
> >   - Quick access to the data
> >   - You don't want to use the current ways to access the data by the
> publisher (endpoint, ttl dumps, LDFragments)
> >  However, you ask for a binary format (HDT), but you don't have enough
> memory to set up your own environment/endpoint due to lack of memory.
> > For that reason, you are asking the publisher to support both .hdt and
> .hdt.index files.
> >
> > Do you think there are many users with your current UC?
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Jérémie Roquet
2017-10-31 14:56 GMT+01:00 Laura Morales :
> 1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. 
> query) require to build an index before I can use the HDT file. I've tried to 
> create the index, but I ran out of memory again (even though the index is 
> smaller than the .hdt file itself). So any Wikidata dump should contain both 
> the .hdt file and the .hdt.index file unless there is another way to generate 
> the index on commodity hardware

I've just loaded the provided hdt file on a big machine (32 GiB wasn't
enough to build the index but ten times this is more than enough), so
here are a few interesting metrics:
 - the index alone is ~14 GiB big uncompressed, ~9 GiB gzipped and
~6.5 GiB xzipped ;
 - once loaded in hdtSearch, Wikidata uses ~36 GiB of virtual memory ;
 - right after index generation, it includes ~16 GiB of anonymous
memory (with no memory pressure, that's ~26 GiB resident)…
 - …but after a reload, the index is memory mapped as well, so it only
includes ~400 MiB of anonymous memory (and a mere ~1.2 GiB resident).

Looks like a good candidate for commodity hardware, indeed. It loads
in less than one second on a 32 GiB machine. I'll try to run a few
queries to see how it behaves.

FWIW, my use case is very similar to yours, as I'd like to run queries
that are too long for the public SPARQL endpoint and can't dedicate a
powerful machine do this full time (Blazegraph runs fine with 32 GiB,
though — it just takes a while to index and updating is not as fast as
the changes happening on wikidata.org).

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikimedia Blog - Wikidata at Five

2017-10-31 Thread David McDonell
Great article! Thank you, Andrew and Rob!!

On Tue, Oct 31, 2017 at 11:41 AM Andrew Lih  wrote:

> Here’s a piece I wrote with Rob Fernandez for the Wikimedia blog about
> Wikidata at five and Wikidatacon.
>
> https://blog.wikimedia.org/2017/10/30/wikidata-fifth-birthday/
>
> -Andrew
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
-- 
David McDonell Co-founder & CEO ICONICLOUD, Inc. "Illuminating the cloud"
M: 703-864-1203 EM: da...@iconicloud.com URL: http://iconicloud.com
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikimedia Blog - Wikidata at Five

2017-10-31 Thread Andrew Lih
Here’s a piece I wrote with Rob Fernandez for the Wikimedia blog about
Wikidata at five and Wikidatacon.

https://blog.wikimedia.org/2017/10/30/wikidata-fifth-birthday/

-Andrew
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDCM: Wikidata usage in Wikivoyage

2017-10-31 Thread Yaroslav Blanter
Thank you Goran.

Cheers
Yaroslav

On Tue, Oct 31, 2017 at 3:28 PM, Goran Milovanovic <
goran.milovanovic_...@wikimedia.de> wrote:

> Hi,
>
> responding to Yaroslav Blanter's following observation on this mailing
> list:
>
> "However, when I look at the statistics of usage,
> http://wdcm.wmflabs.org/WDCM_UsageDashboard/ I see that Wikivoyage
> allegedly uses, in particular, genes, humans (quite a lot, actually), and
> scientific articles. How could this be? I am pretty sure it does not use
> any of these."
>
> Please note that The *Wikidata item usage per semantic category in each
> project type* chart that you have referred to in a later message has a
> logarithmic y-scale (there's a Note explaining this immediately below the
> title of the chart). Also, even from the chart that you were referring to
> you can see that Wikivoyage projects taken together make no use of the
> categories Gene an Scientific Article. The usage of the logarithmic y-axis
> there is a necessity, otherwise we could not offer a comparison across the
> project types (because the differences in usage statistics are huge).
>
> Here's my suggestion on how to obtain a more readable (and more precise)
> information:
>
> - go to the WDCM Usage Dashboard: http://wdcm.wmflabs.org/WDCM_
> UsageDashboard/
> - Tab: Dashboard, and then Tab: Tabs/Crosstabs
> - Enter only: _Wikivoyage in the "Search projects:" field, and select all
> semantic categories in the "Search categories:" field
> - Click "Apply Selection"
>
> What you should be able to learn from the results is that on all
> Wikivoyage projects taken together the total usage of Q5 (Human) is 26, and
> that no items from the Gene (Q7187) or Scientific Article (Q13442814)
> category are used there at all.
>
> Important reminder. The usage statistic in WDCM has the following
> semantics:
>
> - pick an item;
> - count on how many pages in a particular project is that item used;
> - sum up the counts to obtain the usage statistic for that particular item
> in the particular project.
>
> All WDCM Dashboards have a section titled "Description" which provides
> this and similarly important definitions, as well as (hopefully) simple
> descriptions of the respective dashboard's functionality.
>
> Hope this helps.
>
> Best,
> Goran
>
>
>
>
> Goran S. Milovanović, PhD
> Data Analyst, Software Department
> Wikimedia Deutschland
> 
> "It's not the size of the dog in the fight,
> it's the size of the fight in the dog."
> - Mark Twain
> 
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales
I feel like you are misrepresenting my request, and possibly trying to offend 
me as well.

My "UC" as you call it, is simply that I would like to have a local copy of 
wikidata, and query it using SPARQL. Everything that I've tried so far doesn't 
seem to work on commodity hardware since the database is so large. But HDT 
could work. So I asked if a HDT dump could, please, be added to other dumps 
that are periodically generated by wikidata. I also told you already that *I 
AM* trying to use the 1 year old dump, but in order to use the HDT tools I'm 
told that I *MUST* generate some other index first which unfortunately I can't 
generate for the same reasons that I can convert the Turtle to HDT. So what I 
was trying to say is, that if wikidata were to add any HDT dump, this dump 
should contain both the .hdt file and .hdt.index in order to be useful. That's 
about it, and it's not just about me. Anybody who wants to have a local copy of 
wikidata could benefit from this, since setting up a .hdt file seems much 
easier than a Turtle dump. And I don't understand why you're trying to blame me 
for this?

If you are part of the wikidata dev team, I'd greatly appreciate a "can/can't" 
or "don't care" response rather than playing the passive-aggressive game that 
you displayed in your last email.


> Let me try to understand ... 
> You are a "data consumer" with the following needs:
>   - Latest version of the data
>   - Quick access to the data
>   - You don't want to use the current ways to access the data by the 
> publisher (endpoint, ttl dumps, LDFragments)
>  However, you ask for a binary format (HDT), but you don't have enough memory 
> to set up your own environment/endpoint due to lack of memory.
> For that reason, you are asking the publisher to support both .hdt and 
> .hdt.index files. 
>  
> Do you think there are many users with your current UC?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WDCM: Wikidata usage in Wikivoyage

2017-10-31 Thread Goran Milovanovic
Hi,

responding to Yaroslav Blanter's following observation on this mailing list:

"However, when I look at the statistics of usage,
http://wdcm.wmflabs.org/WDCM_UsageDashboard/ I see that Wikivoyage
allegedly uses, in particular, genes, humans (quite a lot, actually), and
scientific articles. How could this be? I am pretty sure it does not use
any of these."

Please note that The *Wikidata item usage per semantic category in each
project type* chart that you have referred to in a later message has a
logarithmic y-scale (there's a Note explaining this immediately below the
title of the chart). Also, even from the chart that you were referring to
you can see that Wikivoyage projects taken together make no use of the
categories Gene an Scientific Article. The usage of the logarithmic y-axis
there is a necessity, otherwise we could not offer a comparison across the
project types (because the differences in usage statistics are huge).

Here's my suggestion on how to obtain a more readable (and more precise)
information:

- go to the WDCM Usage Dashboard:
http://wdcm.wmflabs.org/WDCM_UsageDashboard/
- Tab: Dashboard, and then Tab: Tabs/Crosstabs
- Enter only: _Wikivoyage in the "Search projects:" field, and select all
semantic categories in the "Search categories:" field
- Click "Apply Selection"

What you should be able to learn from the results is that on all Wikivoyage
projects taken together the total usage of Q5 (Human) is 26, and that no
items from the Gene (Q7187) or Scientific Article (Q13442814) category are
used there at all.

Important reminder. The usage statistic in WDCM has the following semantics:

- pick an item;
- count on how many pages in a particular project is that item used;
- sum up the counts to obtain the usage statistic for that particular item
in the particular project.

All WDCM Dashboards have a section titled "Description" which provides this
and similarly important definitions, as well as (hopefully) simple
descriptions of the respective dashboard's functionality.

Hope this helps.

Best,
Goran




Goran S. Milovanović, PhD
Data Analyst, Software Department
Wikimedia Deutschland

"It's not the size of the dog in the fight,
it's the size of the fight in the dog."
- Mark Twain

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Ghislain ATEMEZING
Interesting use case Laura! Your UC is rather "special" :)
Let me try to understand ...
You are a "data consumer" with the following needs:
  - Latest version of the data
  - Quick access to the data
  - You don't want to use the current ways to access the data by the
publisher (endpoint, ttl dumps, LDFragments)
 However, you ask for a binary format (HDT), but you don't have enough
memory to set up your own environment/endpoint due to lack of memory.
For that reason, you are asking the publisher to support both .hdt and
.hdt.index files.

Do you think there are many users with your current UC?


El mar., 31 oct. 2017 a las 14:56, Laura Morales ()
escribió:

> > @Laura: I suspect Wouter wants to know if he "ignores" the previous
> errors and proposes a rather incomplete dump (just for you) or waits for
> Stas' feedback.
>
>
> OK. I wonder though, if it would be possible to setup a regular HDT dump
> alongside the already regular dumps. Looking at the dumps page,
> https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new
> dump is generated once a week more or less. So if a HDT dump could be added
> to the schedule, it should show up with the next dump and then so forth
> with the future dumps. Right now even the Turtle dump contains the bad
> triples, so adding a HDT file now would not introduce more inconsistencies.
> The problem will be fixed automatically with the future dumps once the
> Turtle is fixed (because the HDT is generated from the .ttl file anyway).
>
>
> > Btw why don't you use the oldest version in HDT website?
>
>
> 1. I have downloaded it and I'm trying to use it, but the HDT tools (eg.
> query) require to build an index before I can use the HDT file. I've tried
> to create the index, but I ran out of memory again (even though the index
> is smaller than the .hdt file itself). So any Wikidata dump should contain
> both the .hdt file and the .hdt.index file unless there is another way to
> generate the index on commodity hardware
>
> 2. because it's 1 year old :)
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
-- 
---
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales
> @Laura: I suspect Wouter wants to know if he "ignores" the previous errors 
> and proposes a rather incomplete dump (just for you) or waits for Stas' 
> feedback.


OK. I wonder though, if it would be possible to setup a regular HDT dump 
alongside the already regular dumps. Looking at the dumps page, 
https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is 
generated once a week more or less. So if a HDT dump could be added to the 
schedule, it should show up with the next dump and then so forth with the 
future dumps. Right now even the Turtle dump contains the bad triples, so 
adding a HDT file now would not introduce more inconsistencies. The problem 
will be fixed automatically with the future dumps once the Turtle is fixed 
(because the HDT is generated from the .ttl file anyway).


> Btw why don't you use the oldest version in HDT website?


1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) 
require to build an index before I can use the HDT file. I've tried to create 
the index, but I ran out of memory again (even though the index is smaller than 
the .hdt file itself). So any Wikidata dump should contain both the .hdt file 
and the .hdt.index file unless there is another way to generate the index on 
commodity hardware

2. because it's 1 year old :)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Ghislain ATEMEZING
@Laura: I suspect Wouter wants to know if he "ignores" the previous errors
and proposes a rather incomplete dump (just for you) or waits for Stas'
feedback.
Btw why don't you use the oldest version in HDT website?

El mar., 31 oct. 2017 a las 7:53, Laura Morales ()
escribió:

> @Wouter
>
> > Thanks for the pointer!  I'm downloading from
> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.
>
> Any luck so far?
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
-- 
---
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata