Re: [Labs-l] Labs-l Digest, Vol 39, Issue 13

John Fri, 13 Mar 2015 11:56:56 -0700

where are you getting the list of 300k pages from? I want to get a feel for
the kinds of queries your running so that we can optimize the process for
you.


On Fri, Mar 13, 2015 at 2:53 PM, Marc Miquel <[email protected]> wrote:

> I load from a file "page_titles" and "page_ids" and put them in a
> dictionary. One option I haven't used would be putting than into a database
> and INNER Joining with the pagelinks table to just obtain the links for
> those articles. Still, if the list is 300.000, even this is just 20% of the
> database, it is still a lot.
>
> Marc
> ᐧ
>
> 2015-03-13 19:51 GMT+01:00 John <[email protected]>:
>
>> Where are you getting your list of pages from?
>>
>> On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <[email protected]>
>> wrote:
>>
>>> Hi John,
>>>
>>>
>>> My queries are to obtain "inlinks" and "outlinks" for some articles I
>>> have in a group (x). Then I check (using python) if they have inlinks and
>>> outlinks from another group of articles. By now I am doing a query for each
>>> article. I wanted to obtain all links for group (x) and then do this
>>> comprovation....But getting all links for groups as big as 300000 articles
>>> would imply 6 million links. Is it possible to obtain all this or is there
>>> a MySQL/RAM limit?
>>>
>>> Thanks.
>>>
>>> Marc
>>>
>>> ᐧ
>>>
>>> 2015-03-13 19:29 GMT+01:00 <[email protected]>:
>>>
>>>> Send Labs-l mailing list submissions to
>>>>         [email protected]
>>>>
>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>         https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>> or, via email, send a message with subject or body 'help' to
>>>>         [email protected]
>>>>
>>>> You can reach the person managing the list at
>>>>         [email protected]
>>>>
>>>> When replying, please edit your Subject line so it is more specific
>>>> than "Re: Contents of Labs-l digest..."
>>>>
>>>>
>>>> Today's Topics:
>>>>
>>>>    1. dimension well my queries for very large tables like
>>>>       pagelinks - Tool Labs (Marc Miquel)
>>>>    2. Re: dimension well my queries for very large tables like
>>>>       pagelinks - Tool Labs (John)
>>>>    3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)
>>>>    4. Re: Questions regarding the Labs Terms of use (Ryan Lane)
>>>>    5. Re: Questions regarding the Labs Terms of use (Pine W)
>>>>
>>>>
>>>> ----------------------------------------------------------------------
>>>>
>>>> Message: 1
>>>> Date: Fri, 13 Mar 2015 17:59:09 +0100
>>>> From: Marc Miquel <[email protected]>
>>>> To: "[email protected]" <[email protected]>
>>>> Subject: [Labs-l] dimension well my queries for very large tables like
>>>>         pagelinks - Tool Labs
>>>> Message-ID:
>>>>         <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=
>>>> [email protected]>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> Hello guys,
>>>>
>>>> I have a question regarding Tool Labs. I am doing research on links and
>>>> although I know very well what I am looking for I struggle in how to
>>>> get it
>>>> effectively...
>>>>
>>>> I need to know your opinion because you know very well the system and
>>>> what's feasible and what is not.
>>>>
>>>> I explain you what I need to do:
>>>> I have a list of articles for different languages which I need to check
>>>> their pagelinks and see where they point to and from where they point at
>>>> them.
>>>>
>>>> I now do a query for each article id in this list of articles, which
>>>> goes
>>>> from 80000 in some wikipedias to 300000 in other and more. I have to do
>>>> it
>>>> several times and it is very time consuming (several days). I wish I
>>>> could
>>>> only count the total of links for each case but I need to see only some
>>>> of
>>>> the links per article.
>>>>
>>>> I was thinking about getting all pagelinks and iterating using python
>>>> (which is the language I use for all this). This would be much faster
>>>> because I'd save all the queries, one per article, I am doing now. But
>>>> pagelinks table has millions of rows and I cannot load that because
>>>> mysql
>>>> would die. I could buffer, but I haven't tried if it works also.
>>>>
>>>> I am considering creating a personal table in the database with titles,
>>>> ids, and inner joining to just obtain the pagelinks for these 300.000
>>>> articles. With this I would just retrieve 20% of the database instead of
>>>> the 100%. That would be maybe 8M rows sometimes (page_title or page_id,
>>>> one
>>>> of both per row), or even more... loaded into python dictionaries and
>>>> lists. Would that be a problem...? I have no idea of how much RAM this
>>>> implies and how much I can use in Tool labs.
>>>>
>>>> I am totally lost when I get these problems related to scale...I thought
>>>> about writing to the IRC channel but I thought it was maybe too long and
>>>> too specific. If you give me any hint that would really help.
>>>>
>>>> Thank you very much!
>>>>
>>>> Cheers,
>>>>
>>>> Marc Miquel
>>>> ᐧ
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <
>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html
>>>> >
>>>>
>>>> ------------------------------
>>>>
>>>> Message: 2
>>>> Date: Fri, 13 Mar 2015 13:07:20 -0400
>>>> From: John <[email protected]>
>>>> To: Wikimedia Labs <[email protected]>
>>>> Subject: Re: [Labs-l] dimension well my queries for very large tables
>>>>         like pagelinks - Tool Labs
>>>> Message-ID:
>>>>         <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=
>>>> [email protected]>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> what kind of queries are you doing? odds are they can be optimized.
>>>>
>>>> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <[email protected]>
>>>> wrote:
>>>>
>>>> > Hello guys,
>>>> >
>>>> > I have a question regarding Tool Labs. I am doing research on links
>>>> and
>>>> > although I know very well what I am looking for I struggle in how to
>>>> get it
>>>> > effectively...
>>>> >
>>>> > I need to know your opinion because you know very well the system and
>>>> > what's feasible and what is not.
>>>> >
>>>> > I explain you what I need to do:
>>>> > I have a list of articles for different languages which I need to
>>>> check
>>>> > their pagelinks and see where they point to and from where they point
>>>> at
>>>> > them.
>>>> >
>>>> > I now do a query for each article id in this list of articles, which
>>>> goes
>>>> > from 80000 in some wikipedias to 300000 in other and more. I have to
>>>> do it
>>>> > several times and it is very time consuming (several days). I wish I
>>>> could
>>>> > only count the total of links for each case but I need to see only
>>>> some of
>>>> > the links per article.
>>>> >
>>>> > I was thinking about getting all pagelinks and iterating using python
>>>> > (which is the language I use for all this). This would be much faster
>>>> > because I'd save all the queries, one per article, I am doing now. But
>>>> > pagelinks table has millions of rows and I cannot load that because
>>>> mysql
>>>> > would die. I could buffer, but I haven't tried if it works also.
>>>> >
>>>> > I am considering creating a personal table in the database with
>>>> titles,
>>>> > ids, and inner joining to just obtain the pagelinks for these 300.000
>>>> > articles. With this I would just retrieve 20% of the database instead
>>>> of
>>>> > the 100%. That would be maybe 8M rows sometimes (page_title or
>>>> page_id, one
>>>> > of both per row), or even more... loaded into python dictionaries and
>>>> > lists. Would that be a problem...? I have no idea of how much RAM this
>>>> > implies and how much I can use in Tool labs.
>>>> >
>>>> > I am totally lost when I get these problems related to scale...I
>>>> thought
>>>> > about writing to the IRC channel but I thought it was maybe too long
>>>> and
>>>> > too specific. If you give me any hint that would really help.
>>>> >
>>>> > Thank you very much!
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Marc Miquel
>>>> > ᐧ
>>>> >
>>>> > _______________________________________________
>>>> > Labs-l mailing list
>>>> > [email protected]
>>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>> >
>>>> >
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <
>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html
>>>> >
>>>>
>>>> ------------------------------
>>>>
>>>> Message: 3
>>>> Date: Fri, 13 Mar 2015 17:36:00 +0000
>>>> From: Tim Landscheidt <[email protected]>
>>>> To: [email protected]
>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>> Message-ID: <[email protected]>
>>>> Content-Type: text/plain
>>>>
>>>> (anonymous) wrote:
>>>>
>>>> > [...]
>>>>
>>>> > To be clear: I'm not going to make my code proprietary in
>>>> > any way. I just wanted to know whether I'm entitled to ask
>>>> > for the source of every Labs bot ;-)
>>>>
>>>> Everyone is entitled to /ask/, but I don't think you have a
>>>> right to /receive/ the source :-).
>>>>
>>>> AFAIK, there are two main reasons for the clause:
>>>>
>>>> a) WMF doesn't want to have to deal with individual licences
>>>>    that may or may not have the potential for litigation
>>>>    ("The Software shall be used for Good, not Evil").  With
>>>>    requiring OSI-approved, tried and true licences, the risk
>>>>    is negligible.
>>>>
>>>> b) Bots and tools running on an infrastructure financed by
>>>>    donors, like contributions to Wikipedia & Co., shouldn't
>>>>    be usable for blackmail.  Noone should be in a legal po-
>>>>    sition to demand something "or else ..."  The perpetuity
>>>>    of OS licences guarantees that everyone can be truly
>>>>    thankful to developers without having to fear that other-
>>>>    wise they shut down devices, delete content, etc.
>>>>
>>>> But the nice thing about collaboratively developed open
>>>> source software is that it usually is of a better quality,
>>>> so clandestine code is often not that interesting.
>>>>
>>>> Tim
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> Message: 4
>>>> Date: Fri, 13 Mar 2015 11:52:18 -0600
>>>> From: Ryan Lane <[email protected]>
>>>> To: Wikimedia Labs <[email protected]>
>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>> Message-ID:
>>>>         <
>>>> calkgca3lv-sqoeibesm7ckc0gapjwph_b0hstx+actakmdu...@mail.gmail.com>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>>>> [email protected]>
>>>> wrote:
>>>>
>>>> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>>>> > (verbatim): "Do not use or install any software unless the software is
>>>> > licensed under an Open Source license".
>>>> > What about tools and services made up of software themselves? Do they
>>>> have
>>>> > to be Open Source?
>>>> > Strictly speaking, do the Terms of use require that all code be made
>>>> > available to the public?
>>>> > Thanks in advance.
>>>> >
>>>> >
>>>> As the person who wrote the initial terms and included this I can speak
>>>> to
>>>> the spirit of the term (I'm not a lawyer, so I won't try to go into any
>>>> legal issues).
>>>>
>>>> I created Labs with the intent that it could be used as a mechanism to
>>>> fork
>>>> the projects as a whole, if necessary. A means to this end was including
>>>> non-WMF employees in the process of infrastructure operations (which is
>>>> outside the goals of the tools project in Labs). Tools/services that are
>>>> can't be distributed publicly harm that goal. Tools/services that aren't
>>>> open source completely break that goal. It's fine if you wish to not
>>>> maintain the code in a public git repo, but if another tool maintainer
>>>> wishes to publish your code, there should be nothing blocking that.
>>>>
>>>> Depending on external closed source services is a debatable topic. I
>>>> know
>>>> in the past we've decided to allow it. It goes against the spirit of the
>>>> project, but it doesn't require us to distribute close sourced software
>>>> in
>>>> the case of a fork.
>>>>
>>>> My personal opinion is that your code should be in a public repository
>>>> to
>>>> encourage collaboration. As the terms are written, though, your code is
>>>> required to be open source, and any libraries it depends on must be as
>>>> well.
>>>>
>>>> - Ryan
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <
>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html
>>>> >
>>>>
>>>> ------------------------------
>>>>
>>>> Message: 5
>>>> Date: Fri, 13 Mar 2015 11:29:47 -0700
>>>> From: Pine W <[email protected]>
>>>> To: Wikimedia Labs <[email protected]>
>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>> Message-ID:
>>>>         <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=
>>>> [email protected]>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> Question: are there heightened security or privacy risks posed by having
>>>> non-open-source code running in Labs?
>>>>
>>>> Is anyone proactively auditing Labs software for open source compliance,
>>>> and if not, should this be done?
>>>>
>>>> Pine
>>>> On Mar 13, 2015 10:52 AM, "Ryan Lane" <[email protected]> wrote:
>>>>
>>>> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>>>> > [email protected]> wrote:
>>>> >
>>>> >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>>>> >> (verbatim): "Do not use or install any software unless the software
>>>> is
>>>> >> licensed under an Open Source license".
>>>> >> What about tools and services made up of software themselves? Do they
>>>> >> have to be Open Source?
>>>> >> Strictly speaking, do the Terms of use require that all code be made
>>>> >> available to the public?
>>>> >> Thanks in advance.
>>>> >>
>>>> >>
>>>> > As the person who wrote the initial terms and included this I can
>>>> speak to
>>>> > the spirit of the term (I'm not a lawyer, so I won't try to go into
>>>> any
>>>> > legal issues).
>>>> >
>>>> > I created Labs with the intent that it could be used as a mechanism to
>>>> > fork the projects as a whole, if necessary. A means to this end was
>>>> > including non-WMF employees in the process of infrastructure
>>>> operations
>>>> > (which is outside the goals of the tools project in Labs).
>>>> Tools/services
>>>> > that are can't be distributed publicly harm that goal. Tools/services
>>>> that
>>>> > aren't open source completely break that goal. It's fine if you wish
>>>> to not
>>>> > maintain the code in a public git repo, but if another tool maintainer
>>>> > wishes to publish your code, there should be nothing blocking that.
>>>> >
>>>> > Depending on external closed source services is a debatable topic. I
>>>> know
>>>> > in the past we've decided to allow it. It goes against the spirit of
>>>> the
>>>> > project, but it doesn't require us to distribute close sourced
>>>> software in
>>>> > the case of a fork.
>>>> >
>>>> > My personal opinion is that your code should be in a public
>>>> repository to
>>>> > encourage collaboration. As the terms are written, though, your code
>>>> is
>>>> > required to be open source, and any libraries it depends on must be
>>>> as well.
>>>> >
>>>> > - Ryan
>>>> >
>>>> > _______________________________________________
>>>> > Labs-l mailing list
>>>> > [email protected]
>>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>> >
>>>> >
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <
>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html
>>>> >
>>>>
>>>> ------------------------------
>>>>
>>>> _______________________________________________
>>>> Labs-l mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>
>>>>
>>>> End of Labs-l Digest, Vol 39, Issue 13
>>>> **************************************
>>>>
>>>
>>>
>>
>

_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] Labs-l Digest, Vol 39, Issue 13

Reply via email to