Re: [Wikidata] Let's move forward with support for Wiktionary

2016-09-13 Thread Amirouche Boubekki

Héllo,

I am very happy of this news.

I a wiki newbie interested in using wikidata to do text analysis.
I try to follow the discussion here and on french wiktionary.

I take this as opportunity to try to sum up some concerns that are 
raised on french wiktionary [0]:


- How wikidata and wiktionary databases will be synchronized?

- Will editing wiktionary change? The concern is that this will make 
editing wiktionary more difficult for people.


- Also, what about bots. Will bots be allowed/able to edit wiktionary 
pages after the support of wikidata in wiktionary?


- Another concern is that if edits are done in some wiktionary and that 
edit has an impact on another wiktionary. People will have trouble to 
reconcil their opinion given they don't speak the same language. Can an 
edit in a wiktionary A break wiktionary B?


I understand that wikidata requires new code to support the organisation 
of new relations between the data. I understand that with wikidata it 
will be easy to create interwiki links and thesaurus kind of pages but 
what else can provide wikidata to wiktionary?


[0] https://fr.wiktionary.org/wiki/Projet:Coop%C3%A9ration/Wikidata

Thanks,

i⋅am⋅amz3

On 2016-09-13 15:17, Lydia Pintscher wrote:

Hey everyone :)

Wiktionary is our third-largest sister project, both in term of active
editors and readers. It is a unique resource, with the goal to provide
a dictionary for every language, in every language. Since the
beginning of Wikidata but increasingly over the past months I have
been getting more and more requests for supporting Wiktionary and
lexicographical data in Wikidata. Having this data available openly
and freely licensed would be a major step forward in automated
translation, text analysis, text generation and much more. It will
enable and ease research. And most importantly it will enable the
individual Wiktionary communities to work more closely together and
benefit from each other’s work.

With this and the increased demand to support Wikimedia Commons with
Wikidata, we have looked at the bigger picture and our options. I am
seeing a lot of overlap in the work we need to do to support
Wiktionary and Commons. I am also seeing increasing pressure to store
lexicographical data in existing items (which would be bad for many
reasons).

Because of this we will start implementing support for Wiktionary in
parallel to Commons based on our annual plan and quarterly plans. We
contacted several of our partners in order to get funding for this
additional work. I am happy that Google agreed to provide funding
(restricted to work on Wikidata). With this we can reorganize our team
and set up one part of the team to continue working on building out
the core of Wikidata and support for Wikipedia and Commons and the
other part will concentrate on Wiktionary. (To support and to extend
our work around Wikidata with the help of external funding sources was
our plan in our annual plan 2016:
https://meta.wikimedia.org/wiki/Grants:APG/Proposals/2015-2016_round1/Wikimedia_Deutschland_e.V./Proposal_form#Financials:_current_funding_period)

As a next step I’d like us all to have another careful look at the
latest proposal at
https://www.wikidata.org/wiki/Wikidata:Wiktionary/Development. It has
been online for input in its current form for a year and the first
version is 3 years old now. So I am confident that the proposal is in
a good shape to start implementation. However I’d like to do a last
round of feedback with you all to make sure the concept really is
sane. To make it easier to understand there is now also a pdf
explaining the concept in a slightly different way:
https://commons.wikimedia.org/wiki/File:Wikidata_for_Wiktionary_announcement.pdf
Please do go ahead and review it. If you have comments or questions
please leave them on the talk page of the latest proposal at
https://www.wikidata.org/wiki/Wikidata_talk:Wiktionary/Development/Proposals/2015-05.
I’d be especially interested in feedback from editors who are familiar
with both Wiktionary and Wikidata.

Getting support for Wiktionary done - just like for Commons - will
take some time but I am really excited about the opportunities it will
open up especially for languages that have so far not gotten much or
any technological support.


Cheers
Lydia

--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

___
Wikidata 

[Wikidata] Example tasks to solve using a Python graphdb

2016-11-25 Thread Amirouche Boubekki

Héllo,


I am developping a graph database using Python. The point in that
is to have an easy to setup database that can handle bigger than
RAM dataset like wikidata. Right now there is no such database
available except in the Java world via Noe4j embedded.

AjguDB [0] is a graphdb library I've been a working on for some
time that allows to store graphical data on disk a query the data
using a query language similar to Tinkerpop's gremlin. There is
NO SPARQL query engine, yet.

I am looking for tasks to solve using wikidata that can help
demonstrate the use of the library. Otherwise said, what do
you use wikidata for using your favorite tool? I'd like to
replicate that work using AjguDB and see whether it's up to
the task.

Thanks in advance!

PS: I have written a similar library in Scheme.

[0] https://github.com/amirouche/AjguDB

--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] I'm calling it. We made it ;-)

2016-12-31 Thread Amirouche Boubekki

On 2016-12-31 11:57, Lydia Pintscher wrote:

Folks,

We're now officially mainstream ;-)
https://www.buzzfeed.com/katiehasty/song-ends-melody-lingers-in-2016?utm_term=.nszJxrKqR#.sknE4nVAg



Neat!



Cheers
Lydia

--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Solve legal uncertainty of Wikidata

2018-05-18 Thread Amirouche Boubekki

What wikidata doesn't track the license of each piece of information?!

--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata as software metadata repository

2018-12-19 Thread Amirouche Boubekki
Hello,

I am investigating with several people other the rainbow in GNU project as
part of guix [0].

Our goal is to make our packages easier to discover by our users via
full-text search or structured queries.

Questions:

a) I see Arch and Debian have properties. What would it take to have a guix
property?

b) Is there already a group of people working together to put in place a
list of requirements
for software entities to be considered good in the sens of wikidata?

c) What level of notoriety requires a software to be included in wikidata?

Thanks in advance!

[0] http://gnu.org/s/guix is both a package manager and an Operating System
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-07 Thread Amirouche Boubekki
Le jeu. 6 juin 2019 à 21:33, Guillaume Lederrey  a
écrit :

> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>

I will add that, in an ideal world, setting up wikidata ie. the interface
that allows edits and the entity search service and WDQS.

wikidata tools should be (more) accessible.


> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore.


Reasonably, addressing all of the above constraints is unlikely to
> ever happen.


never say never ;-)


> For example, the update process is asynchronous. It is by nature
> expected to lag. In the best case, this lag is measured in minutes,
> but can climb to hours occasionally. This is a case of prioritizing
> stability and correctness (ingesting all edits) over update latency.
> And while we can work to reduce the maximum latency, this will still
> be an asynchronous process and needs to be considered as such.
>


> We currently have one Blazegraph expert working with us to address a
> number of performance and stability issues. We
> are planning to hire an additional engineer to help us support the
> service in the long term. You can follow our current work in phabricator
> [2].
>
> If anyone has experience with scaling large graph databases, please
> reach out to us, we're always happy to share ideas!
>

Good luck!


> Thanks all for your patience!
>
>Guillaume
>
> [1]
> https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy


Here is my point of view regarding some discussion happening in the talk
page:

> Giving up on SPARQL.

There is an ongoing effort to draft a 1.2 
version of the SPARQL. It is the right time to give some feedback.

Also, look at https://github.com/w3c/EasierRDF/

> JanusGraph  (successor of Titan, now part
DataStax) - Written in java, using scalable data-storage (cassandra/hbase)
and indexing engines (ElasticSearch/SolR), queryable

That would make wikidata much less accessible. Even if JanusGraph has a
Oracle Berkeley backend. The full-text search and geospatial indices are in
yet-another-processus.

> I can't think of any other way than transforming the wikidata RDF
representation to a more suitable one for graph-properties engines

FWIW, OpenCog's AtomSpace has a neo4j backend but they do not use it.

Also, graph-properties engines makes slow to represent things like:

("wikidata", "used-by", "opencog")
("wikidata", "used-by", "google")

That is, one has to create an hyper-edge if you want to be able to query
those facts.


> [2] https://phabricator.wikimedia.org/project/view/1239/



Best regards,


Amirouche ~ amz3
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] searching for Wikidata items

2019-06-07 Thread Amirouche Boubekki
Hello all,


Le mar. 4 juin 2019 à 15:46, Marielle Volz  a
écrit :

> Yes, the api is at
> https://www.wikidata.org/w/api.php?action=query=search=Bush
>
> There's a sandbox where you can play with the various options:
>
> https://www.wikidata.org/wiki/Special:ApiSandbox#action=query=json=search=Bush
>


Can anyone point me to the relevant code that supports the search feature?
Or explain to me how it is done?


Thanks in advance!


On Tue, Jun 4, 2019 at 2:22 PM Tim Finin  wrote:
>
>> What's the best way to search Wikidata for items whose name or alias
>> matches a string?  The search available via pywikibot seems to only find a
>> match if the search string is a prefix of an item's name or alias, so
>> searching for "Bush" does not return any of the the George Bush items.  I
>> don't want to use a SPARQL query with a regex, since I expect that to be
>> slow.
>>
>> The search box on the Wikidata pages is closer to what I want.  Is there
>> a good way to call this via an API?
>>
>> Ideally, I'd like to be able to specify a language and also a set of
>> types, but I can do that once I've identified candidates based on a simple
>> match with a query string.
>>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Amirouche Boubekki
Le dim. 9 juin 2019 à 23:18, Amirouche Boubekki <
amirouche.boube...@gmail.com> a écrit :

> I made a proposal for a grant at
> https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB
>
> Mind the fact that this is not about the versioned quadstore. It is about
> simple triplestore, it mainly missing bindings for foundationdb and SPARQL
> syntax.
>
> Also, I will prolly need help to interface with geo and label services.
>
> Feedback welcome!
>

I got "feedback" in others threads from the same topic that I will quote
and reply to.

> So there needs to be some smarter solution, one that we'd unlike to
develop inhouse

Big cat, small fish. As wikidata continue to grow, it will have specific
needs.
Needs that are unlikely to be solved by off-the-shelf solutions.

> but one that has already been verified by industry experience and other
deployments.

FoundationDB and WiredTiger are respectively used at Apple (among other
companies)
and MongoDB since 3.2 all over-the-world. WiredTiger is also used at Amazon.

> We also have a plan on improving the throughput of Blazegraph, which
we're working on now.

What is the phabricator ticket? Please.

> "Evaluation of Metadata Representations in RDF stores"

I don't understand how this is related to the scaling issues.

> [About proprietary version Virtuoso], I dare say [it must have] enormous
advantage for us to consider running it in production.

That will be vendor lock-in for wikidata and wikimedia along all the poor
souls that try to interop with it.

> This project seems to be still very young.

First commit
<https://github.com/arangodb/arangodb/commit/6577d5417a000c29c9ee7666cbcc3cefae6eee21>
is from 2011.

> AgangoDB seems to be document database inside.

It has two backends: MMAP and rocksdb.

> While I would be very interested if somebody took on themselves to model
Wikidata
> in terms of ArangoDB documents,

It looks like a bounty.

ArangoDB is a multi-model database, it support:

- Document
- Graph
- Key-Value

> load the whole data and see what the resulting performance would be, I am
not sure
> it would be wise for us to invest our team's - very limited currently -
resources into that.

I am biased. I would advise against trying arangodb. This is another short
term solution.

> the concept of having single data store is probably not realistic at
least
> within foreseeable timeframes.

Incorrect. My solution is in the foreseeable future.

> We use separate data store for search (ElasticSearch) and probably will
> have to have separate one for queries, whatever would be the mechanism.

It would be interesting to read how much "resource" is poured into keeping
all those synchronized:

- ElasticSearch
- MySQL
- BlazeGraph

Maybe some REDIS?
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-16 Thread Amirouche Boubekki
Hello Sebastian and Stas,

Le mer. 12 juin 2019 à 19:27, Amirouche Boubekki <
amirouche.boube...@gmail.com> a écrit :

> Hello Sebastian,
>
> First thanks a lot for the reply. I started to believe that what I was
> saying was complete nonsense.
>
> Le mer. 12 juin 2019 à 16:51, Sebastian Hellmann <
> hellm...@informatik.uni-leipzig.de> a écrit :
>
>> Hi Amirouche,
>>
>> Any open data projects that are running open databases with FoundationDB
>> and WiredTiger? Where can I query them?
>>
>
> Thanks for asking. I will set up a wiredtiger instance of wikidata. I need
> a few days, maybe a week (or two :)).
>
> I could setup FoundationDB on a single machine instead but it will require
> more time (maybe one more week).
>
> Also, it will not support geo-queries. I will try to make labelling work
> but with a custom syntax (inspired form SPARQL).
>

I figured that anything that is not SPARQL will not be convincing. Getting
my engine 100% compatible is much work.

The example deployment I have given in the previous message should be
enough to convince you that
FoundationDB can store WDQS.

The documented limits about FDB states that it to support up to 100TB of
data
<https://apple.github.io/foundationdb/known-limitations.html#database-size>.
That is 100x times more
than what WDQS needs at the moment.

Anyway, I updated my proposal to support wikimedia foundation to transition
to a new solution in the wiki
<https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB> to
reflect
the new requirements where the required space was reduced from 12T SSD to
6T SSD, it is based on this FDB forum topic
<https://forums.foundationdb.org/t/sizing-and-pricing/379/2?u=amirouche>
and an optimisation I will make in my engine. That proposal is biased
toward getting a FDB prototype. It could be
reworked to emphasize the fact that a benchmarking tool must be put
together to be able to tell which solution is best.

My estimations might be off, especially the 1 month of GCP credits.

To be honest, WDQS is a low hanging fruit compared to the goal of building
a portable wikidata.

I am offering my full-time services, it is up to you decide what will
happen.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] About Functional QuadStore for WikiData (was Are we ready for our future)

2019-06-16 Thread Amirouche Boubekki
I created another draft proposal to create a *prototype* to scale wikidata,
using the tools I have been building, that goes beyond only scaling
WikiData Query Service. The first quarter should be reserved to WDQS.

As you might have seen, the first proposal
 is 6
months and in this proposal
WDQS should be replaced in 4 months. I take that into account in the last
quarter that is supposed to be reserved for bugfixing.

https://meta.wikimedia.org/wiki/Grants:Project/Iamamz3/Prototype_A_Scalable_WikiData

Feedback welcome!

Le jeu. 16 mai 2019 à 19:03, Thad Guidry  a écrit :

> Yes, Freebase search supported "as_of_time" in its MQL syntax...(it didn't
> however during its first 5 months of life if I recall using however, but
> was added later and loved by the community for helping with abuse
> mitigation)
> https://developers.google.com/freebase/v1/search
>
> Thad
> https://www.linkedin.com/in/thadguidry/
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-09 Thread Amirouche Boubekki
I made a proposal for a grant at
https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB

Mind the fact that this is not about the versioned quadstore. It is about
simple triplestore, it mainly missing bindings for foundationdb and SPARQL
syntax.

Also, I will prolly need help to interface with geo and label services.

Feedback welcome!
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Amirouche Boubekki
Hello Sebastian,

First thanks a lot for the reply. I started to believe that what I was
saying was complete nonsense.

Le mer. 12 juin 2019 à 16:51, Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de> a écrit :

> Hi Amirouche,
> On 12.06.19 14:07, Amirouche Boubekki wrote:
>
> > So there needs to be some smarter solution, one that we'd unlike to
> develop inhouse
>
> Big cat, small fish. As wikidata continue to grow, it will have specific
> needs.
> Needs that are unlikely to be solved by off-the-shelf solutions.
>
>
> Are you suggesting to develop the database in-house?
>
Yes! At least part of it. The domain specific part.

> even MediaWiki uses MySQL
>
Yes, but it is not because of its technical merits. Similarly for PHP.
Historically, PHP and MySQL were easy to setup and easy to use
but otherwise difficult to work with. This is/was painful enough that
nowadays the goto RDBMS is PostgreSQL even if MySQL is still
very popular [0][1]. Those are technical reasons. Also, I agree it is
not because MySQL had no ACID guarantees when it started
1995, that nowadays it is a bad choice.

[0] https://trends.google.com/trends/explore?q=MySQL,PostgreSQL
[1] https://stackshare.io/stackups/mysql-vs-postgresql

> but one that has already been verified by industry experience and other
> deployments.
>
> FoundationDB and WiredTiger are respectively used at Apple (among other
> companies)
> and MongoDB since 3.2 all over-the-world. WiredTiger is also used at
> Amazon.
>
> Let`s not talk about MongoDB, it is irrelevant and very mixed.
>

I am giving an example deployment of WiredTiger. WiredTiger is an ordered
Key-Value
Store that is the storage engine of MongoDB since 3.2. It was created by
independent
company and later mongodb acquired WiredTiger. It is still GPLv2 or v3.
Among the founders
there is one of the engineer that created bsddb that Oracle has bought.
Also, I am not saying
WiredTiger solve all the problems of mongodb. I am just saying that because
WiredTiger is
the storage backend of MongoDB since 3.2 it has seen widespread usage and
testing.

Some say it is THE solution for scalability, others have said it was the
> biggest disappointment.
>

Some people gave warnings about the technical issues of mongodb before 3.2.
Also, Caveat emptor. The situation is better that a few years back. After
all that
was open source / free software / source available software since the
beginning.

Like I said above, WiredTiger is not the solution of all problems. I cited
WiredTiger
as a possible tool for building a cluster similar the current one where
machines have full
copies of the data. The advantage of WiredTiger is that it is easier to
setup (compared
to a distributed database) but it still requires fine-tuning /
configuration. Also, there is
many other Ordered Key-Value store in the wild. I have documented those in
the document:

https://github.com/scheme-requests-for-implementation/srfi-167/blob/master/libraries.md

In particular, if WDQS doesn't want to use ACID transactions, there might
be a better
solution. Other popular options are LMDB (used in OpenLDAP) and RocksDB by
Facebook (that is LevelDB fork). But again, that is ONE possibility, my
design / database
work with any of the libraries described in the above libraries.md url.

My recommendation for production cluster is to use FoundationDB.
Because it can scale horizontally and provides single / double / triple
replication. If a
node is down, the write and reads can still continue if you have enough
machine up.

WiredTiger would better suited for single machine (and my database (can)
support
both WiredTiger and FoundationDB with the same code base).

Do FoundationDB and WiredTiger have any track record for hosting open data
> projects or being chosen by open data projects?
>
tl;dr: I don't know.

Like I said previously, WiredTiger is used in many contexts among others it
used at Amazon Web Services (AWS).

FoundationDB is used at Apple, I don't remember which services rely on it
but at least the Data Science team
rely on it. The main contributor did a lightning talk about it:

   Entity Store: A FoundationDB Layer for Versioned Entities with Fine
Grained <https://youtu.be/16uU_Aaxp9Y>

That is the use-case that looks the more like data.

More on popularity contest, it is used at WaveFront (owned by VMWare) that
is an analytic tool.
Here is a talk:

  Running FDB at scale <https://youtu.be/M438R4SlTFE>

JanusGraph has FDB backend, see the talk:

  The JanusGraph FoundationDB Storage Adapter <https://youtu.be/rQM_ZPZy8Ck>

It is also used at SnowFlake <https://www.snowflake.com/> that is
apparently a datawhare house, here is the talk:

   How FoundationDB powers SnowflakeDB's metadata
<https://youtu.be/KkeyjFMmIf8>

It is also used at SkuVault as multi-model database, see the forum topic:


https://forums.foundationdb.org/t/success-stor

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Amirouche Boubekki
Le mer. 12 juin 2019 à 19:11, Stas Malyshev  a
écrit :

> Hi!
>
> >> So there needs to be some smarter solution, one that we'd unlike to
> > develop inhouse
> >
> > Big cat, small fish. As wikidata continue to grow, it will have specific
> > needs.
> > Needs that are unlikely to be solved by off-the-shelf solutions.
>
> Here I think it's good place to remind that we're not Google, and
> developing a new database engine inhouse is probably a bit beyond our
> resources and budgets.


Today, the problem is not the same as the one MySQL, PostgreSQL, blazegraph
and openlink had when they started working on their respective databases.
See
below.


> Fitting existing solution to our goals - sure, but developing something
> new of

that scale is probably not going to happen.
>

It will.

> FoundationDB and WiredTiger are respectively used at Apple (among other
> > companies)
> > and MongoDB since 3.2 all over-the-world. WiredTiger is also used at
> Amazon.
>
> I believe they are, but I think for our particular goals we have to
> limit themselves for a set of solution that are a proven good match for
> our case.
>

See the other mail I just sent. We are a turning point in database
engineering
history. The very last database systems that were built are all based on
Ordered Key Value Store, see Google Spanner paper [0].

Thanks to WT/MongoDB and Apple, those are readily available, in widespread
use
and fully open source. It is only missing a few pieces for making it work a
fully
backward compatible way with WDQS (at scale).

[0] https://ai.google/research/pubs/pub39966


> > That will be vendor lock-in for wikidata and wikimedia along all the
> > poor souls that try to interop with it.
>
> Since Virtuoso is using standard SPARQL, it won't be too much of a
> vendor lock in, though of course the standard does not cover all, so
> some corners are different in all SPARQL engines.


There is a big chance that same thing that happened with the www will
happen with RDF. That is one big player own all the implementations.


> This is why even migration between SPARQL engines, even excluding

operational aspects, is non-trivial.


I agree.


> Of course, migration to any non-SPARQL engine would be order of magnitude

more disruptive, so right now we do not seriously consider doing that.
>

I also agree.

>
> As I already mentioned, there's a difference between "you can do it" and
> "you can do it efficiently". [...] The tricky part starts when you need to
> run millions
> of queries on 10B triples database. If your backend is not optimal for
> that task, it's not going to perform.
>

I already did small benchmarks against blazegraph. I will do more intensive
benchmarks using wikidata (and reduce the requirements in terms of SSD).


Thanks for the reply.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] About Functional QuadStore for WikiData (was Are we ready for our future)

2019-05-16 Thread Amirouche Boubekki
Hi Joshua,

Thanks for your input.

Le jeu. 16 mai 2019 à 17:02, Joshua Shinavier  a écrit :

> Hi Amirouche,
>



> The version history and time-travel features sound a lot like the
> "integrated versioning system" of Freebase, circa 2009 when they (Metaweb)
> presented at WWW.
>

Reading through [0] it seems freebase only allowed undo, whereas in datae
it will be possible to query full history and undo commits.

[0] https://www.aaai.org/Papers/AAAI/2007/AAAI07-355.pdf


> As Freebase's data was transferred to Wikidata, this sounds a little
> circular; I wonder what advantages datae would offer vis-a-vis Freebase.
>

Freebase data was transfered to wikidata. What I am looking for is
replacing wikibase + blazegraph that is both edition and query would happen
against the same database. Making it much easier to setup and maintain.


> Disclaimer: this is coming from a Wikidata lurker who just happens to like
> the Freebase approach to versioning of knowledge graphs, similar to what
> you have described.
>
> Josh
>
>
> On Sat, May 4, 2019 at 5:49 AM Amirouche Boubekki <
> amirouche.boube...@gmail.com> wrote:
>
>> Le sam. 4 mai 2019 à 04:00, Yuri Astrakhan  a
>> écrit :
>>
>>> Sounds interesting, is there a github repo?
>>>
>>
>> Thanks for your interest. This is still a work-in-progress. I made a
>> prototype that demos
>> that the history significance measure allows to do time travelling
>> queries in the v0 branch.
>> Right now, I am working on getting it all together.
>>
>>
>> https://github.com/awesome-data-distribution/datae/tree/master/docs/SCHEME20XY#abstract
>>
>> On Fri, May 3, 2019 at 8:19 PM Amirouche Boubekki <
>>> amirouche.boube...@gmail.com> wrote:
>>>
>>>> GerardM post triggered my interest to post to the mailing list. As you
>>>> might know I am working on functional quadstore that is quadstore that
>>>> keeps around old version of data, like a wiki but in direct-acyclic-graph.
>>>> It only stores differences between commits. It rely on snapshot of the
>>>> latest version for fast reads. My ultimate goal is to build somekind of
>>>> portable knowlege base. That is something like WikiBase + blazegraph but
>>>> that you spinup on regular machine with the press of button.
>>>>
>>>> Enought brag about me. I wont't reply to all the message of the threads
>>>> one by one but:
>>>>
>>>> Here is what SHOULD BE possible:
>>>>
>>>> - incremental dumps
>>>> - time traveling queries
>>>> - full dumps
>>>> - The federation of wikibase SHOULD BE possible since it stored in a
>>>> history like GIT and git pull git push are planned in the ROADMAP
>>>>
>>>> And online edition of the quadstore.
>>>>
>>>> Access Control List are not designed yet, I except that this should be
>>>> enforced by the application layer.
>>>>
>>>> I planned start working on Data Management System (something like CKAN)
>>>> with search featrure. But I would gadly work with wikimedia instead.
>>>>
>>>> Also, given it modeled after git, one can do merge-request like
>>>> features, ie. exit the massive import that is crippled.
>>>>
>>>> What I would need is logs possibly with timing of queries (read and
>>>> write) to do benchmarks.
>>>>
>>>
>> Request:
>>
>>- Is it possible to have logs of read queries done against blazegraph
>>with timings?
>>- Is it possible to have logs of write queries done against mysql
>>with timings?
>>
>> In the best of the worlds, it would be best to have logs to replicate the
>> workload of the producion databases.
>>
>>
>>> Maybe I should ask for fund at mediawiki?
>>>>
>>>
>> What about this? Any possibility to have my project funded by the
>> foundation somehow?
>>
>>
>>>
>>>> FWIW, I got 2 times faster than blazegraph on microbenchmark.
>>>>
>>>>
>>>>> Hoi,
>>>>> Wikidata grows like mad. This is something we all experience in the
>>>>> really bad response times we are suffering. It is so bad that people are
>>>>> asked what kind of updates they are running because it makes a difference
>>>>> in the lag times there are.
>>>>>
>>>>> Given that Wikidata is growing like a weed, it follows that there are
>>>>

[Wikidata] About Functional QuadStore for WikiData (was Are we ready for our future)

2019-05-03 Thread Amirouche Boubekki
GerardM post triggered my interest to post to the mailing list. As you
might know I am working on functional quadstore that is quadstore that
keeps around old version of data, like a wiki but in direct-acyclic-graph.
It only stores differences between commits. It rely on snapshot of the
latest version for fast reads. My ultimate goal is to build somekind of
portable knowlege base. That is something like WikiBase + blazegraph but
that you spinup on regular machine with the press of button.

Enought brag about me. I wont't reply to all the message of the threads one
by one but:

Here is what SHOULD BE possible:

- incremental dumps
- time traveling queries
- full dumps
- The federation of wikibase SHOULD BE possible since it stored in a
history like GIT and git pull git push are planned in the ROADMAP

And online edition of the quadstore.

Access Control List are not designed yet, I except that this should be
enforced by the application layer.

I planned start working on Data Management System (something like CKAN)
with search featrure. But I would gadly work with wikimedia instead.

Also, given it modeled after git, one can do merge-request like features,
ie. exist the massive import that is crippled.

What I would need is logs possibly with timing of queries (read and write)
to do benchmarks.

Maybe I should ask for fund at mediawiki?

FWIW, I got 2 times faster than blazegraph on microbenchmark.


> Hoi,
> Wikidata grows like mad. This is something we all experience in the really
> bad response times we are suffering. It is so bad that people are asked
> what kind of updates they are running because it makes a difference in the
> lag times there are.
>
> Given that Wikidata is growing like a weed, it follows that there are two
> issues. Technical - what is the maximum that the current approach supports
> - how long will this last us. Fundamental - what funding is available to
> sustain Wikidata.
>
> For the financial guys, growth like Wikidata is experiencing is not
> something you can reliably forecast. As an organisation we have more money
> than we need to spend, so there is no credible reason to be stingy.
>
> For the technical guys, consider our growth and plan for at least one
> year. When the impression exists that the current architecture will not
> scale beyond two years, start a project to future proof Wikidata.
>
> It will grow and the situation will get worse before it gets better.
> Thanks,
>   GerardM
>
> PS I know about phabricator tickets, they do not give the answers to the
> questions we need to address.
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] About Functional QuadStore for WikiData (was Are we ready for our future)

2019-05-04 Thread Amirouche Boubekki
Le sam. 4 mai 2019 à 04:00, Yuri Astrakhan  a
écrit :

> Sounds interesting, is there a github repo?
>

Thanks for your interest. This is still a work-in-progress. I made a
prototype that demos
that the history significance measure allows to do time travelling queries
in the v0 branch.
Right now, I am working on getting it all together.

https://github.com/awesome-data-distribution/datae/tree/master/docs/SCHEME20XY#abstract

On Fri, May 3, 2019 at 8:19 PM Amirouche Boubekki <
> amirouche.boube...@gmail.com> wrote:
>
>> GerardM post triggered my interest to post to the mailing list. As you
>> might know I am working on functional quadstore that is quadstore that
>> keeps around old version of data, like a wiki but in direct-acyclic-graph.
>> It only stores differences between commits. It rely on snapshot of the
>> latest version for fast reads. My ultimate goal is to build somekind of
>> portable knowlege base. That is something like WikiBase + blazegraph but
>> that you spinup on regular machine with the press of button.
>>
>> Enought brag about me. I wont't reply to all the message of the threads
>> one by one but:
>>
>> Here is what SHOULD BE possible:
>>
>> - incremental dumps
>> - time traveling queries
>> - full dumps
>> - The federation of wikibase SHOULD BE possible since it stored in a
>> history like GIT and git pull git push are planned in the ROADMAP
>>
>> And online edition of the quadstore.
>>
>> Access Control List are not designed yet, I except that this should be
>> enforced by the application layer.
>>
>> I planned start working on Data Management System (something like CKAN)
>> with search featrure. But I would gadly work with wikimedia instead.
>>
>> Also, given it modeled after git, one can do merge-request like features,
>> ie. exit the massive import that is crippled.
>>
>> What I would need is logs possibly with timing of queries (read and
>> write) to do benchmarks.
>>
>
Request:

   - Is it possible to have logs of read queries done against blazegraph
   with timings?
   - Is it possible to have logs of write queries done against mysql with
   timings?

In the best of the worlds, it would be best to have logs to replicate the
workload of the producion databases.


> Maybe I should ask for fund at mediawiki?
>>
>
What about this? Any possibility to have my project funded by the
foundation somehow?


>
>> FWIW, I got 2 times faster than blazegraph on microbenchmark.
>>
>>
>>> Hoi,
>>> Wikidata grows like mad. This is something we all experience in the
>>> really bad response times we are suffering. It is so bad that people are
>>> asked what kind of updates they are running because it makes a difference
>>> in the lag times there are.
>>>
>>> Given that Wikidata is growing like a weed, it follows that there are
>>> two issues. Technical - what is the maximum that the current approach
>>> supports - how long will this last us. Fundamental - what funding is
>>> available to sustain Wikidata.
>>>
>>> For the financial guys, growth like Wikidata is experiencing is not
>>> something you can reliably forecast. As an organisation we have more money
>>> than we need to spend, so there is no credible reason to be stingy.
>>>
>>> For the technical guys, consider our growth and plan for at least one
>>> year. When the impression exists that the current architecture will not
>>> scale beyond two years, start a project to future proof Wikidata.
>>>
>>> It will grow and the situation will get worse before it gets better.
>>> Thanks,
>>>   GerardM
>>>
>>> PS I know about phabricator tickets, they do not give the answers to the
>>> questions we need to address.
>>>
>>
>>
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANN] nomunofu v0.1.0

2019-12-10 Thread Amirouche Boubekki
Le dim. 8 déc. 2019 à 18:52, Amirouche Boubekki
 a écrit :
>
> I am very pleased to announce the immediate availability of nomunofu.
>
> nomunofu is database server written in GNU Guile that is powered by 
> WiredTiger ordered key-value store.
>
> It allows to store and query triples.  The goal is to make it much easier, 
> definitely faster to query as big as possible tuples of three items.  To 
> achieve that goal, the server part of the database is made very simple and it 
> only knows how to do pattern matching.  Also, it is possible to swap the 
> storage engine to something that is horizontally scalable and resilient.
>

I pushed portable binaries built with gnu guix for amd64 with a small
database file. You can download it with the following command:

  $ wget https://hyper.dev/nomunofu-v0.1.3.tar.gz

The uncompressed directory is 7GB.

Once you have downloaded the tarball, you can do the following cli
dance to run the database server:

  $ tar xf nomunofu-v0.1.3.tar.gz && cd nomunofu && ./nomunofu serve 8080

The database will be available on port 8080. Then you can use the
python client to do queries.

Here is example run on the current dataset, that queries for instance
of (P31) government (Q3624078):

In [1]: from nomunofu import Nomunofu
In [2]: from nomunofu import var
In [3]: nomunofu = Nomunofu('http://localhost:8080');
In [4]: nomunofu.query((var('uid'),
'http://www.wikidata.org/prop/direct/P31',
'http://www.wikidata.org/entity/Q3624078'), (var('uid'),
'http://www.w3.org/2000/01/rdf-schema#label', var('label')))

Out[4]:
[{'uid': 'http://www.wikidata.org/entity/Q31', 'label': 'Belgium'},
 {'uid': 'http://www.wikidata.org/entity/Q183', 'label': 'Germany'},
 {'uid': 'http://www.wikidata.org/entity/Q148', 'label': 'China'},
 {'uid': 'http://www.wikidata.org/entity/Q148',
  'label': "People's Republic of China"},
 {'uid': 'http://www.wikidata.org/entity/Q801', 'label': 'Israel'},
 {'uid': 'http://www.wikidata.org/entity/Q45', 'label': 'Portugal'},
 {'uid': 'http://www.wikidata.org/entity/Q155', 'label': 'Brazil'},
 {'uid': 'http://www.wikidata.org/entity/Q916', 'label': 'Angola'},
 {'uid': 'http://www.wikidata.org/entity/Q233', 'label': 'Malta'},
 {'uid': 'http://www.wikidata.org/entity/Q878',
  'label': 'United Arab Emirates'},
 {'uid': 'http://www.wikidata.org/entity/Q686', 'label': 'Vanuatu'},
 {'uid': 'http://www.wikidata.org/entity/Q869', 'label': 'Thailand'},
 {'uid': 'http://www.wikidata.org/entity/Q863', 'label': 'Tajikistan'},
 {'uid': 'http://www.wikidata.org/entity/Q1049', 'label': 'Sudan'},
 {'uid': 'http://www.wikidata.org/entity/Q1044', 'label': 'Sierra Leone'},
 {'uid': 'http://www.wikidata.org/entity/Q912', 'label': 'Mali'},
 {'uid': 'http://www.wikidata.org/entity/Q819', 'label': 'Laos'},
 {'uid': 'http://www.wikidata.org/entity/Q298', 'label': 'Chile'},
 {'uid': 'http://www.wikidata.org/entity/Q398', 'label': 'Bahrain'},
 {'uid': 'http://www.wikidata.org/entity/Q12560', 'label': 'Ottoman Empire'}]

As of right now there is less than 10 000 000 triples that were
imported. Blank nodes are included, but only english labels are
imported.

You can find the source code at:

  https://github.com/amirouche/nomunofu

I hope you have a good day!

Amirouche ~ zig ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANN] nomunofu v0.1.0

2019-12-12 Thread Amirouche Boubekki
I am pleased to share with you the v0.1.4 binary release. It contains
the following improvements:

- The REST API takes JSON as input, which will make it easier to
create clients in other programming languages;

- The REST API takes limit and offset as query string. The maximum
limit is 1000;

- There is better error handling, the server will return a HTTP status
code 400 if it detects an error;

- Add aggregation queries `sum`, `count` and `average`, see the Python
client (nomunofu.py) to know how to properly format the query;

- Python client method `Nomunofu.query(*patterns, limit=None,
offset=None)` returns a generator.

Also, the harmless warnings are silenced. The database files are
compatible with the previous release.

This release comes with full wikidata lexemes triples.

You can download the amd64 portable binary release plus database files
with the following command:

  wget http://hyper.dev/nomunofu-v0.1.4.tar.bz2

The directory is 11G uncompressed.

Grab the source code with the following command:

  git clone https://github.com/amirouche/nomunofu

Here is an example Python query that returns at most 5 adverbs:

In [10]: for item in nomunofu.query(
...: (var('uid'), wikibase('lexicalCategory'),
'http://www.wikidata.org/entity/Q380057'),
...: (var('uid'), rdfschema('label'), var('label')),
...: limit=5):
...: print(item)
...:
{'uid': 'http://www.wikidata.org/entity/L3244', 'label': 'always'}
{'uid': 'http://www.wikidata.org/entity/L4124', 'label': 'here'}
{'uid': 'http://www.wikidata.org/entity/L4326', 'label': 'often'}
{'uid': 'http://www.wikidata.org/entity/L5201', 'label': 'too'}
{'uid': 'http://www.wikidata.org/entity/L5321', 'label': 'yet'}



Cheers,



Amirouche ~ zig ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] [ANN] nomunofu v0.1.0

2019-12-08 Thread Amirouche Boubekki
I am very pleased to announce the immediate availability of nomunofu.

nomunofu is database server written in GNU Guile that is powered by
WiredTiger ordered key-value store.

It allows to store and query triples.  The goal is to make it much easier,
definitely faster to query as big as possible tuples of three items.  To
achieve that goal, the server part of the database is made very simple and
it only knows how to do pattern matching.  Also, it is possible to swap the
storage engine to something that is horizontally scalable and resilient.

The client must be smarter, and do as they please to full-fill user
requests. Today release only include a minimal Python client.  In the
future, I plan to extend the Python client to fully support SPARQL 1.1.

Preliminary tests over 100 000 and 1 000 000 triples are good looking. Next
step is to reach 1 billion triples and eventually 9 billions wikidata
triples.

You can get the code with the following command:

  git clone https://github.com/amirouche/nomunofu

After the installation of GNU Guix [0], you can do:

  make init && gunzip test.nt.gz && make index && make web

And in another terminal:

  make query

Thanks to Guix, portable binaries for amd64 Ubuntu 18.04 will be made
available in a few weeks, along with this, a docker image will be built.
The binary release will include wikidata pre-loaded.

[0] https://guix.gnu.org/download/

Here is an example ipython session:

$ ipython
Python 3.7.3 (default, Oct  7 2019, 12:56:13)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.10.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from nomunofu import Nomunofu


In [2]: from nomunofu import var


In [3]: nomunofu = Nomunofu('http://localhost:8080');


In [4]: nomunofu.query((var('uid'), "
http://www.w3.org/2000/01/rdf-schema#label;,  "Belgium"))

Out[4]: [{'uid': 'http://www.wikidata.org/entity/Q31'}]

In [5]: nomunofu.query((var('uid'), "
http://www.w3.org/2000/01/rdf-schema#label;,  "Belgium"), (var('about'),
"http://
   ...: schema.org/about", var('uid')))

Out[5]:
[{'uid': 'http://www.wikidata.org/entity/Q31',
  'about': 'https://www.wikidata.org/wiki/Special:EntityData/Q31'},
 {'uid': 'http://www.wikidata.org/entity/Q31',
  'about': 'https://it.wikivoyage.org/wiki/Belgio'},
 {'uid': 'http://www.wikidata.org/entity/Q31',
  'about': 'https://an.wikipedia.org/wiki/Belchica'},
 {'uid': 'http://www.wikidata.org/entity/Q31',
  'about': 'https://sl.wikipedia.org/wiki/Belgija'},
 {'uid': 'http://www.wikidata.org/entity/Q31',
  'about': 'https://pfl.wikipedia.org/wiki/Belgien'},
 {'uid': 'http://www.wikidata.org/entity/Q31',
  'about': 'https://crh.wikipedia.org/wiki/Bel%C3%A7ika'},
 {'uid': 'http://www.wikidata.org/entity/Q31',
  'about': 'https://fiu-vro.wikipedia.org/wiki/Belgi%C3%A4'},
 {'uid': 'http://www.wikidata.org/entity/Q31',
  'about': 'https://fr.wikipedia.org/wiki/Belgique'}
...

Cheers,

Amirouche ~ zig ~ https://hyper.dev
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Full-text / autocomplete search on labels

2019-10-04 Thread Amirouche Boubekki
Hello all,

Le ven. 4 oct. 2019 à 09:58, Thomas Francart
 a écrit :
>
> Hello
>
> I understand the wikidata SPARQL label service only fetches the labels, but 
> does not allow to search/filter on them; labels are also available in 
> regulare rdfs:label on which a FILTER can be made.

See Etorre Rizza answer about filtering.

> However I would like to do full-text search over labels, to e.g. feed an 
> autocomplete search field,

I understand what you want to do but that is not called "full-text
search". FTS means "inside the text" or "all the text" that does not
apply to concept search or wikification.

The most common term for this kind of search is called "fuzzy search"
or "spell checking" or "autocomplete". The basic algorithm is to
search terms using prefixes of the input query.

More on that later 

---

Amirouche ~ amz3 ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-22 Thread Amirouche Boubekki
Hello all!

Le mar. 17 déc. 2019 à 18:15, Aidan Hogan  a écrit :
>
> Hey all,
>
> As someone who likes to use Wikidata in their research, and likes to
> give students projects relating to Wikidata, I am finding it more and
> more difficult to (recommend to) work with recent versions of Wikidata
> due to the increasing dump sizes, where even the truthy version now
> costs considerable time and machine resources to process and handle.

Maybe that is a software problem? What tools do you use to process the dump?

> More generally, I think the growing data volumes might inadvertently
> scare people off taking the dumps and using them in their research.
>
> One idea we had recently to reduce the data size for a student project
> while keeping the most notable parts of Wikidata was to only keep claims
> that involve an item linked to Wikipedia; in other words, if the
> statement involves a Q item (in the "subject" or "object") not linked to
> Wikipedia, the statement is removed.

One similar scheme will be to only keep concepts that are part of wikipedia
vital article [0] and their neighboors (to be defined).

[0] https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/5

Related to wikipedia vital articles, for which I only know the english version,
the problem with that is that wikipedia vital articles are not
available in structured
format.  I made a few month back a proposal to add that information to wikidata,
I had no feedback.  There is https://www.wikidata.org/wiki/Q43375360.
Not sure where to go from there.

> I wonder would it be possible for Wikidata to provide such a dump to
> download (e.g., in RDF) for people who prefer to work with a more
> concise sub-graph that still maintains the most "notable" parts?

The best thing would be to allow people to create their own vital wikidata
concepts, similar to how there is custom wikipedia vital lists and taking
inspiration from to the tool that was released recently.

> While
> of course one could compute this from the full-dump locally, making such
> a version available as a dump directly would save clients some
> resources, potentially encourage more research using/on Wikidata, and
> having such a version "rubber-stamped" by Wikidata would also help to
> justify the use of such a dataset for research purposes.

I agree.

> ... just an idea I thought I would float out there. Perhaps there is
> another (better) way to define a concise dump.
>
> Best,
> Aidan
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Amirouche ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANN] nomunofu v0.1.0

2019-12-22 Thread Amirouche Boubekki
Hello all ;-)


I ported the code to Chez Scheme to do an apple-to-apple comparison
between GNU Guile and Chez and took the time to launch a few queries
against Virtuoso available in Ubuntu 18.04 (LTS).

Spoiler: the new code is always faster.

The hard disk is SATA, and the CPU is dubbed: Intel(R) Xeon(R) CPU
E3-1220 V2 @ 3.10GHz

I imported latest-lexeme.nt (6GB) using guile-nomunofu, chez-nomunofu
and Virtuoso:

- Chez takes 40 minutes to import 6GB
- Chez is 3 to 5 times faster than Guile
- Chez is 11% faster than Virtuoso

Regarding query time, Chez is still faster than Virtuoso with or
without cache.  The query I am testing is the following:

SELECT ?s ?p ?o
FROM 
WHERE {
  ?s   .
  ?s 
 .
  ?s  ?o
};

Virtuoso first query takes: 1295 msec.
The second query takes: 331 msec.
Then it stabilize around: 200 msec.

chez nomunofu takes around 200ms without cache.

There is still an optimization I can do to speed up nomunofu a little.


Happy hacking!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Contd: [ANN] nomunofu v0.1.0

2019-12-22 Thread Amirouche Boubekki
Le dim. 22 déc. 2019 à 21:23, Kingsley Idehen  a écrit :
>
> On 12/22/19 4:17 PM, Kingsley Idehen wrote:
>
> On 12/22/19 3:17 PM, Amirouche Boubekki wrote:
>
> Hello all ;-)
>
>
> I ported the code to Chez Scheme to do an apple-to-apple comparison
> between GNU Guile and Chez and took the time to launch a few queries
> against Virtuoso available in Ubuntu 18.04 (LTS).
>
> Spoiler: the new code is always faster.
>
> The hard disk is SATA, and the CPU is dubbed: Intel(R) Xeon(R) CPU
> E3-1220 V2 @ 3.10GHz
>
> I imported latest-lexeme.nt (6GB) using guile-nomunofu, chez-nomunofu
> and Virtuoso:
>
> - Chez takes 40 minutes to import 6GB
> - Chez is 3 to 5 times faster than Guile
> - Chez is 11% faster than Virtuoso
>
> Regarding query time, Chez is still faster than Virtuoso with or
> without cache.  The query I am testing is the following:
>
> SELECT ?s ?p ?o
> FROM <http://fu>
> WHERE {
>   ?s <http://purl.org/dc/terms/language> 
> <http://www.wikidata.org/entity/Q150> .
>   ?s <http://wikiba.se/ontology#lexicalCategory>
> <http://www.wikidata.org/entity/Q1084> .
>   ?s <http://www.w3.org/2000/01/rdf-schema#label> ?o
> };
>
> Virtuoso first query takes: 1295 msec.
> The second query takes: 331 msec.
> Then it stabilize around: 200 msec.
>
> chez nomunofu takes around 200ms without cache.
>
> There is still an optimization I can do to speed up nomunofu a little.
>
>
> Happy hacking!
>
> If you are going to make claims about Virtuoso, please shed light on
> your Virtuoso configuration and host machine.
>
> How much memory do you have on this machine? What the CPU Affinity re
> CPUs available.

I did not setup CPU affinity?.. I am not sure what it is.

>
> Is there a URL for sample data used in your tests?

The sample data is
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.nt.bz2

>
>
> Looking at 
> https://ark.intel.com/content/www/us/en/ark/products/65734/intel-xeon-processor-e3-1220-v2-8m-cache-3-10-ghz.html,
>  your Virtuoso INI settings are even more important due to the fact that we 
> have CPU Affinity of 4 in play i.e., you need configure Virtuoso such that it 
> optimizes behavior for this setup.
>

Thanks a lot for the input. My .ini file is empty. There is 4 cpu/core
on the machine I using with 32GB or RAM. What should I do?

Thanks in advance!

> --
> Regards,
>
> Kingsley Idehen
> Founder & CEO
> OpenLink Software
> Home Page: http://www.openlinksw.com
> Community Support: https://community.openlinksw.com
> Weblogs (Blogs):
> Company Blog: https://medium.com/openlink-software-blog
> Virtuoso Blog: https://medium.com/virtuoso-blog
> Data Access Drivers Blog: 
> https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
>
> Personal Weblogs (Blogs):
> Medium Blog: https://medium.com/@kidehen
> Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
>   http://kidehen.blogspot.com
>
> Profile Pages:
> Pinterest: https://www.pinterest.com/kidehen/
> Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
> Twitter: https://twitter.com/kidehen
> Google+: https://plus.google.com/+KingsleyIdehen/about
> LinkedIn: http://www.linkedin.com/in/kidehen
>
> Web Identities (WebID):
> Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
> : 
> http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Amirouche ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Fwd: nomunofu, WDQS, and future of wikidata

2020-01-25 Thread Amirouche Boubekki
Hello all,


I would like to know what you think of the proposal I made at:

  https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS

Like I said previously on wikidata mailing list, I can address the
following problems related to WDQS:

> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> ref: https://lists.wikimedia.org/pipermail/wikidata/2019-June/013124.html

The other proposal I made is about replacing both wikibase and blazegraph:

   
https://meta.wikimedia.org/wiki/Grants:Project/Iamamz3/Prototype_A_Scalable_WikiData

What do you think?


-- 
Amirouche ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status of Wikidata Query Service

2020-02-10 Thread Amirouche Boubekki
Hello Guillaume,

Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey
 a écrit :
>
> Hello all!
>
> First of all, my apologies for the long silence. We need to do better in 
> terms of communication. I'll try my best to send a monthly update from now 
> on. Keep me honest, remind me if I fail.
>

It will be nice to have some feedback on my grant request at:

  https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS

Or one of the other threads on the very same mailing list.

> Another attempt to get update lag under control is to apply back pressure on 
> edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is 
> obviously less than ideal (at least as long as WDQS updates are lagging as 
> often as they are), but does allow the service to recover from time to time. 
> We probably need to iterate on this, provide better granularity, 
> differentiate better between operations that have an impact on update lag and 
> those which don't.
>
> On the slightly better news side, we now have a much better understanding of 
> the update process and of its shortcomings. The current process does a full 
> diff between each updated entity and what we have in blazegraph. Even if a 
> single triple needs to change, we still read tons of data from Blazegraph. 
> While this approach is simple and robust, it is obviously not efficient. We 
> need to rewrite the updater to take a more event streaming / reactive 
> approach, and only work on the actual changes.

When it will be done, it will be still a short term solution

> This is a big chunk of work, almost a complete rewrite of the updater,

> and we need a new solution to stream changes with guaranteed ordering 
> (something that our kafka queues don't offer). This is where we are focusing 
> our energy at the moment, this looks like the best option to improve the 
> situation in the medium term. This change will probably have some functional 
> impacts [3].

Guaranteed ordering in a multi-party distributed setting has no easy
solution, and apparently it is not provided by Kafka.  For a
non-technical person, you can read
https://en.wikipedia.org/wiki/Two_Generals%27_Problem

> Some longer term thoughts:
>
> Keeping all of Wikidata in a single graph is most probably not going to work 
> long term.

:(

> We have not found examples of public SPARQL endpoints with > 10 B triples and 
> there is probably a good reason for that.

Because Wikimedia is the only non-profit in the field?

> We will probably need to split the graphs at some point.

:(

> We don't know how yet

:(

> (that's why we loaded the dumps into Hadoop, that might give us some more 
> insight).

:(

> We might expose a subgraph with only truthy statements. Or have 
> language-specific graphs, with only language-specific labels.

:(

> Or something completely different.

:)

> Keeping WDQS / Wikidata as open as they are at the moment might not be 
> possible in the long term. We need to think if / how we want to implement 
> some form of authentication and quotas.

With blacklists and whitelists, but this is huge anyway.

> Potentially increasing quotas for some use cases, but keeping them strict for 
> others. Again, we don't know how this will look like, but we're thinking 
> about it.

> What you can do to help:
>
> Again, we're not sure. Of course, reducing the load (both in terms of edits 
> on Wikidata and of reads on WDQS) will help. But not using those services 
> makes them useless.

What about making the lag part of the service.  I mean, you could
reload WDQS periodically, for instance daily, and drop the updater
altogether. Who needs to see the updates live in WDQS as soon as edits
are done in wikidata?

> We suspect that some use cases are more expensive than others (a single 
> property change to a large entity will require a comparatively insane amount 
> of work to update it on the WDQS side). We'd like to have real data on the 
> cost of various operations, but we only have guesses at this point.
>
> If you've read this far, thanks a lot for your engagement!
>
>   Have fun!
>

Will do.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status of Wikidata Query Service

2020-02-10 Thread Amirouche Boubekki
Le lun. 10 févr. 2020 à 18:23, Marco Neumann  a écrit :
>
> why all the sad faces?

> the Semantic Web will be distributed after all

The semantic Web is already distributed.

> and there is no need to stuff everything into one graph.

Everything into one graph, or if you prefer in one place, is the gist
of the idea of a library or encyclopedia.

> it just requires us as an RDF community to spend more time developing ideas 
> around efficient query distribution

Maybe. But does not preclude the aggregation or sum of knowledge to happen.

> and focus on relationships and links in wikidata

Like I wrote above, a distributed knowledge base is already the state
of the things. I am not sure how to understand that part of the
sentence.

> rather than building a monolithic database

That is the gist of my proposal.  Without the ability to run wikidata
at a small scale, WMF will fail at knowledge equity.

> for humongous arbitrary joins and table scans

I proposed something along the lines of
https://linkeddatafragments.org as known as "thin server, thick
client" I had no feedback :(

> as a free for all.

With that, I heartily agree.  With the ability to downscale wikidata
infrastructure, and make companies and institutions pay for the stream
of changes to apply to their local instance, it will make things much
easier.

> The slogan "sum of all human knowledge" in one place should not be taken too 
> literally.

I disagree.

>
> it's I believe what wikidata as a project already does in any event, 
> actually, the SPARQL endpoint as an extension to the wikidata architecture 
> around wikibase should be used more pro-actively to connect multiple RDF data 
> providers for search.

Read my proposal at
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS

The title is misleading, I intended to change it to Future-proof
WikiData.  WDQS or querying is an integral part of wikidata and must
not be merely an addon.

> I would think that this is already a common use case for wikidata users who 
> enrich their remote queries with wikidata data.

I do not understand.  Yes, people enrich wikidata queries with their data. And?

> All that said it's quite an achievement to scale the wikidata SPARQL endpoint 
> to where it is now.
> Congratulations to the team and I look forward to seeing more of it in the 
> future.

Yes, I agree with that.  Congratulations!  I am very proud to be part
of the Wikimedia community.

The current WMF proposal that is called "sharding", see details at:

  https://en.wikipedia.org/wiki/Shard_(database_architecture)

It is, not future proof.  I have not done any analysis, but I bet that
most of the 2TB of wikidata is English, so even if you shard by
language, you will still end up with a gigantic graph.  Also, most of
the data is not specific to a natural language, so one can not
possibly split the data by language.

If WMF comes up with another sharding strategy, how will edits that
span multiple regions will happen?

How will it make entering the wikidata party easier?

I dare to write in the open: it seems to me like we are witnessing
"Earth is flat vs. Earth is not flat" kind of event.


Thanks for the reply!


> On Mon, Feb 10, 2020 at 4:11 PM Amirouche Boubekki 
>  wrote:
>>
>> Hello Guillaume,
>>
>> Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey
>>  a écrit :
>> >
>> > Hello all!
>> >
>> > First of all, my apologies for the long silence. We need to do better in 
>> > terms of communication. I'll try my best to send a monthly update from now 
>> > on. Keep me honest, remind me if I fail.
>> >
>>
>> It will be nice to have some feedback on my grant request at:
>>
>>   https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
>>
>> Or one of the other threads on the very same mailing list.
>>
>> > Another attempt to get update lag under control is to apply back pressure 
>> > on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This 
>> > is obviously less than ideal (at least as long as WDQS updates are lagging 
>> > as often as they are), but does allow the service to recover from time to 
>> > time. We probably need to iterate on this, provide better granularity, 
>> > differentiate better between operations that have an impact on update lag 
>> > and those which don't.
>> >
>> > On the slightly better news side, we now have a much better understanding 
>> > of the update process and of its shortcomings. The current process does a 
>> > full diff between each updated entity and what we have in blazegraph. Even 
>> > if a single triple needs to change, we still read tons of data from 
>> > Blazegraph

Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-13 Thread Amirouche Boubekki
Le lun. 13 juil. 2020 à 21:22, Adam Sanchez  a écrit :
>
> I have 14T SSD  (RAID 0)
>
> Le lun. 13 juil. 2020 à 21:19, Amirouche Boubekki
>  a écrit :
> >
> > Le lun. 13 juil. 2020 à 19:42, Adam Sanchez  a écrit 
> > :
> > >
> > > Hi,
> > >
> > > I have to launch 2 million queries against a Wikidata instance.
> > > I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with 
> > > RAID 0).
> > > The queries are simple, just 2 types.
> >
> > How much SSD in Gigabytes do you have?
> >
> > > select ?s ?p ?o {
> > > ?s ?p ?o.
> > > filter (?s = ?param)
> > > }

Can you confirm that the above query is the same as:

select ?p ?o {
  param ?p ?o
}

Where param is one of the two million params.

Also, did you investigate where the bottleneck is? Look into disk
usage and CPU load. glances [0] can provide that information.

Can you run the thread pool on another machine?

Some back of the envelope calculation 2 000 000 queries in 6 hours,
means your system achieve 10 milliseconds per query: AFAIK, that is
good.

[0] https://github.com/nicolargo/glances/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-13 Thread Amirouche Boubekki
Le lun. 13 juil. 2020 à 19:42, Adam Sanchez  a écrit :
>
> Hi,
>
> I have to launch 2 million queries against a Wikidata instance.
> I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 
> 0).
> The queries are simple, just 2 types.

How much SSD in Gigabytes do you have?

> select ?s ?p ?o {
> ?s ?p ?o.
> filter (?s = ?param)
> }

Is that the same as:

select ?p ?o {
 param ?p ?o
}

Where param is one of the two million params.

> select ?s ?p ?o {
> ?s ?p ?o.
> filter (?o = ?param)
> }
>
> If I use a Java ThreadPoolExecutor takes 6 hours.
> How can I speed up the queries processing even more?
>
> I was thinking :
>
> a) to implement a Virtuoso cluster to distribute the queries or
> b) to load Wikidata in a Spark dataframe (since Sansa framework is
> very slow, I would use my own implementation) or
> c) to load Wikidata in a Postgresql table and use Presto to distribute
> the queries or
> d) to load Wikidata in a PG-Strom table to use GPU parallelism.
>
> What do you think? I am looking for ideas.
> Any suggestion will be appreciated.
>
> Best,
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Amirouche ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata-tech] nomunofu, WDQS, and future of wikidata

2019-12-18 Thread Amirouche Boubekki
Hello,


I would like to know if you are interested by the proposal I made at:

  https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB

Like I said previously on wikidata mailing list, I can address the
following problems related to WDQS:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

ref: https://lists.wikimedia.org/pipermail/wikidata/2019-June/013124.html

The other proposal I made is about replacing both wikibase and blazegraph:

   
https://meta.wikimedia.org/wiki/Grants:Project/Iamamz3/Prototype_A_Scalable_WikiData

nomunofu is a working prototype I made to micro benchmark (again) GNU Guile:

  https://github.com/amirouche/nomunofu

What do you think?

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Fwd: nomunofu, WDQS, and future of wikidata

2020-01-25 Thread Amirouche Boubekki
Hello all,


I would like to know what you think of the proposal I made at:

  https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS

Like I said previously on wikidata mailing list, I can address the
following problems related to WDQS:

> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> ref: https://lists.wikimedia.org/pipermail/wikidata/2019-June/013124.html

The other proposal I made is about replacing both wikibase and blazegraph:

   
https://meta.wikimedia.org/wiki/Grants:Project/Iamamz3/Prototype_A_Scalable_WikiData

What do you think?


-- 
Amirouche ~ https://hyper.dev

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Future-proof WDQS (Was: Re: [Wikidata] [ANN] nomunofu v0.1.0)

2019-12-23 Thread Amirouche Boubekki
Checkout my proposal at
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS

I started working a paper (more will follow) that will document and
support my work, see
https://en.wikiversity.org/wiki/WikiJournal_Preprints/Generic_Tuple_Store#Future-proof_WDQS

Happy Holydays ;-)

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Slow loading process

2020-07-03 Thread Amirouche Boubekki
Le jeu. 11 juin 2020 à 11:13, David Causse  a écrit :
>
> Hi,
>
> did you "munge"[0] the dumps prior to loading them?
> As a comparison, loading the munged dump on a WMF production machine (128G, 
> 32cores, SSD drives) takes around 8days.
>
> 0: https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Data_preparation
>

munge.sh can be found at
https://github.com/wikimedia/wikidata-query-deploy/blob/master/munge.sh

The source is available at
https://github.com/wikimedia/wikidata-query-rdf/blob/master/tools/src/main/java/org/wikidata/query/rdf/tool/Munge.java

>
> On Thu, Jun 11, 2020 at 12:37 AM Denny Vrandečić  wrote:
>>
>> Did you see this?
>>
>> https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/

I read Total time: ~5.5 days that is really impressive. My latest
attempt at loading latest-lexmes.nt (10G uncompressed) inside my
triple store took me 1 day and it requires 10G of disk space. I made
progress but still far from being able to compete with blazegraph on
that matter. I have an idea about some optimization to do. munge.sh
will help

>>
>> On Wed, Jun 10, 2020, 12:51 Leandro Tabares Martín 
>>  wrote:
>>>
>>> Dear all,
>>>
>>> I'm loading the whole wikidata dataset into Blazegraph using a High 
>>> Performance Computer. I gave 120 GB RAM and 3 processing cores to the job. 
>>> After almost 24 hours of load the "wikidata.jnl" file has only 28 GB as 
>>> size. Initially the process was fast, but as the file increased its size 
>>> the loading speed has decreased. I realize that only 14 GB of RAM are being 
>>> used. I already implemented the recomendations given in 
>>> https://github.com/blazegraph/database/wiki/IOOptimization Do you have some 
>>> other recommendations to increase the loading speed?

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech